Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization Chih Lee University of Connecticut - Storrs, [email protected]

Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization Chih Lee University of Connecticut - Storrs, Chihlee@Engr.Uconn.Edu

University of Connecticut OpenCommons@UConn

Doctoral Dissertations University of Connecticut Graduate School

1-17-2014 Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization Chih Lee University of Connecticut - Storrs, [email protected]

Follow this and additional works at: https://opencommons.uconn.edu/dissertations

Recommended Citation Lee, Chih, "Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization" (2014). Doctoral Dissertations. 304. https://opencommons.uconn.edu/dissertations/304 Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization

Chih Lee, Ph.D. University of Connecticut, 2014

ABSTRACT

A transcription factor (TF) is a protein or protein complex. It regulates the expression of its target genes by physically binding to the regulatory regions of these genes. The binding sites of a TF naturally share a common pattern or motif with one another. Given known binding sites of a TF, a TF model can be built to scan sequences for putative binding sites. This is known as a transcription factor binding site (TFBS) search problem. In this dissertation, we investigate the TFBS search problem using machine learning approaches.

In general, the known binding sites of a TF are of variable lengths and have to be aligned before a model can be built. Transcription factor binding site alignment is considered an unsupervised learning problem since no other information about the unaligned binding sites is given. We propose an al- Chih Lee - University of Connecticut - 2014 gorithm that considers the lengths of TFBSs and dependencies of nucleotide positions in a binding site. The novel method is named LASAGNA (Length-

Aware Site Alignment Guided by Nucleotide Association).

Studies often utilize TFBS search tools to predict the binding sites of a TF in a DNA sequence when binding sites found by assays are not available. The analysis often involves TF model collection, promoter sequence retrieval and visualization, requiring several tools to accomplish. To accelerate TFBS analyses, we developed a novel integrated webtool named LASAGNA-Search.

This user-friendly tool allows users to perform the analysis without leaving the site.

TFBS search methods are considered supervised learning algorithms since they learn from example binding sites of a TF. Most of the TFBS search methods consider only known binding sites of a TF and hence deal with one-class classiﬁcation problems. However, non-binding sites contain information about the TF as well. When non-binding sites are available, searching for

TFBSs becomes a two-class classiﬁcation problem. We propose two novel methods named the negative-to-positive vector and the optimal discriminating vector methods, utilizing both binding sites and non-binding sites. Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization

Chih Lee

M.Sc. Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, 2005

A Dissertation Submitted in Partial Fulﬁllment of the Requirements for the Degree of Doctor of Philosophy at the University of Connecticut

2014 Copyright by

Chih Lee

2014

i APPROVAL PAGE

Doctor of Philosophy Dissertation

Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization

Presented by Chih Lee, M.Sc. CSIE

Major Advisor Chun-Hsi Huang

Associate Advisor Jinbo Bi

Associate Advisor Sanguthevar Rajasekaran

Associate Advisor Daniel Schwartz

Associate Advisor Dong-Guk Shin

University of Connecticut 2014

TABLE OF CONTENTS

Introduction : Machine Learning and Computational TFBS Analysis . 1 0.1. Supervised and Unsupervised Learning ...... 1 0.2. Computational TFBS Analysis ...... 3 0.2.1. Transcription Factor Binding Site Alignment ...... 4 0.2.2. Transcription Factor Binding Site Search ...... 7 0.2.3. Chapter Organization ...... 9

Ch. 1 : LASAGNA ...... 11 1.1. Background ...... 11 1.2. Methods ...... 14 1.2.1. The Search Module ...... 14 1.2.2. The LASAGNA Algorithm ...... 16 1.2.3. LASAGNA for ChIP-seq Data ...... 19 1.2.4. Scoring a Putative Binding Site ...... 21 1.3. Results and Discussion ...... 22 1.3.1. Comparison of Alignment Algorithms ...... 22 1.3.2. Comparison of TFBS Search Methods ...... 29 1.3.3. Application of LASAGNA-ChIP to ChIP-seq Data ...... 33 1.3.4. LASAGNA is Simple and Eﬀective ...... 60 1.4. Conclusions ...... 62

Ch. 2 : LASAGNA-Search ...... 65 2.1. Background ...... 65 2.2. Materials and Methods...... 66 2.2.1. Modules ...... 67 2.2.2. TF Model Collections ...... 72 2.3. Results and Discussion ...... 78 2.3.1. User Interface ...... 79 2.3.2. Comparison of Features to Existing Webtools ...... 83 2.3.3. Evaluation of Precomputed TF Models ...... 86 2.4. Future Directions ...... 203

Ch. 3 : Searching for TFBSs in Vector Spaces ...... 207 3.1. Background ...... 207 3.2. Methods ...... 208 3.2.1. Data Sets ...... 208 3.2.2. Notation ...... 210 3.2.3. Embedding Short Sequences in Vector Spaces ...... 211

iv 3.2.4. Searching for TFBSs in Vector Spaces ...... 213 3.2.5. The NPV Method ...... 214 3.2.6. The ODV Method ...... 216 3.2.7. The PSSM and ULPB Methods ...... 219 3.3. Results and Discussion ...... 220 3.3.1. Performance Assessment and Evaluation Metrics ...... 220 3.3.2. Prokaryotic Transcription Factor Binding Sites ...... 222 3.3.3. Eukaryotic Transcription Factor Binding Sites ...... 223 3.3.4. Motif Subtype Identiﬁcation in Vector Spaces ...... 224 3.3.5. Independent validation on ChIP-seq Data ...... 226 3.4. Conclusions ...... 234

Bibliography ...... 237

v LIST OF FIGURES

1.2.1 An Illustration of LASAGNA with Ka = 0...... 18 1.2.2 LASAGNA-ChIP Flowchart ...... 20 1.3.1 Overall ROC Curves for the Three Alignment Algorithms ...... 26 1.3.2 Distribution of Ks by Species and Conserved Domain (LASAGNA) 35 1.3.3 Distribution of Ks by Species and Conserved Domain (ClustalW2) 36 1.3.4 Distribution of Ks by Species and Conserved Domain (MEME) . . . 37 1.3.5 Comparison of the PSSM Method Dependent on LASAGNA to SiTaR ...... 38

2.2.1 Architecture of LASAGNA-Search ...... 67 2.2.2 Comparison of Scoring Strategies Using TF Models Collected by LASAGNA-Search ...... 69 2.2.3 An Inferred Gene Regulatory Network of Human Genes TP53 andMYB...... 72 2.3.1 Input Page of LASAGNA-Search ...... 88 2.3.2 Result Page of LASAGNA-Search ...... 89 2.3.3 Visualization of Hits in the UCSC Genome Browser ...... 90 2.3.4 Comparison of Precomputed TF Models by Average Precision . . . 205 2.3.5 Comparison of Precomputed TF Models by Accuracy ...... 206

3.2.1 Illustration of Embedding a Short Sequence in Vector Space . . . . . 212 3.2.2 Illustration of the NPV Method ...... 215 3.2.3 The Orthogonal Projection of s onto t ...... 217 3.3.1 Comparison of the PSSM, ULPB, NPV and ODV Methods on the RegulonDB Data Set ...... 228 3.3.2 Comparison of the PSSM, ULPB, NPV and ODV Methods on the JASPAR Data Set ...... 229 3.3.3 Illustration of the kNPV Method ...... 230 3.3.4 The kNPV (kODV) Method Versus the NPV (ODV) Method ...... 231

vi LIST OF TABLES

1.3.1 TFBSs in TRANSFAC Public Database by Species ...... 23 1.3.2 Species-wise and Overall Comparisons between LASAGNA and ClustalW2 ...... 27 1.3.3 Comparison of Two Groups of TFs Divided According to Results on LASAGNA and ClustalW2 ...... 28 1.3.4 Species-wise and Overall Comparisons between LASAGNA and MEME ...... 29 1.3.5 Comparison of Two Groups of TFs Divided According to Results on LASAGNA and MEME ...... 30 1.3.6 Distribution of the 1751 Binding Sites of 90 TFs in TRANSFAC Public Database ...... 31 1.3.7 List of 38 ChIP-seq Experiments ...... 39 1.3.8 Sequence Logos of Motifs Found by LASAGNA-ChIP and MEME 45

2.2.9 Summary of TF Model Collections ...... 73 2.3.10 ...... 91 2.3.11 Human ENCODE Tracks Used for Validating TF Models ...... 92 2.3.12 Mouse ENCODE Tracks Used for Validating TF Models ...... 136 2.3.13 Performance of LASAGNA-Search TF Models Validatedby ENCODE Human ChIP-seq Data ...... 140 2.3.14 Performance of LASAGNA-Search TF Models Validatedby ENCODE Mouse ChIP-seq Data ...... 176 2.3.15 Performance of MAPPER2 TF Models Validated by ENCODE Human ChIP-seq Data ...... 183 2.3.16 Performance of MAPPER2 TF Models Validated by ENCODE Mouse ChIP-seq Data ...... 200

3.2.17 Statistics of the E. coli TFs in RegulonDB ...... 209 3.2.18 Statistics of TFs in the JASPAR Database ...... 227 3.3.19 Results of Independent Validation on ChIP-seq Data ...... 232

vii Introduction Machine Learning and Computational Transcription Factor Binding Site Analysis

0.1 Supervised and Unsupervised Learning

Machine learning problems in general fall into two categories, supervised learning and unsupervised learning. The classification problem is a supervised learning problem. In a classification problem, each data instance is tagged with a class label. The labeled instances “supervise” the learning process, which produces a classifier. The classifier is able to distinguish one class from another and hence can be used to classify a data instance whose class is unknown. A data instance in this case can be variables describing a person, while the class labels are normal and cancer. The trained classifier can predict if a person has cancer based on these variables. The clustering problem, on the other hand, is a unsupervised learning problem. Each data instance is not tagged with a class label in this case. A clustering algorithm divides data instances into groups based on the similarity between two data instances. One example is to group cancer patients so as to identify cancer subtypes.

1 Embedding data instances in Rp is usually a first step in machine learning, where p is the number of variables describing an instance. For some data, an instance is readily described by p real variables. Microarray data [75] is one example, where each array is described by the expression of p genes. Often- times, biological data instances are not readily described by p real variables and hence conversion is needed to embed each instance in Rp. For instance, to manipulate sequence data in the Euclidean space, each sequence needs to be embedded in Rp. Once embedded in Rp, it is often desired to manipulate the instances in a lower dimensional subspace Rk, where k < p. A subspace can be found such that certain requirements are satisfied. Principal component analysis (PCA) [39], a unsupervised learning approach, finds a subspace such that a desired portion of variance in the data is retained in the subspace while pairwise similarity between data instances is approximately preserved. Previous studies showed success of PCA in population structure analysis [46], ethnicity inference [52] and haplogroup inference [88]. With additional information, data instances can be categorized into 2 or more classes. In this case, data instances can be placed in a subspace found by

Fisher’s discriminant analysis (FDA), which can be viewed as a supervised learning method. In this subspace, class centers are separated as far from one another as possible while class members are pulled as close toward

2 their respective class centers as possible. It was demonstrated that FDA can be used to visually identify the best similarity measure between two short

DNA sequences [48] in searching for transcription factor binding sites.

PCA and FDA systematically find a subspace such that certain requirements are satisfied. Domain knowledge however can also be utilized to find subspaces. Lee and Huang [49] approached the transcription factor (TF) binding site (TFBS) search problem by first embedding short sequences of length l or l-mers in space, where l is the length of TFBSs. TFBSs can then be searched in subspaces identified by domain knowledge. Specifically, these are promising subspaces implied by previous studies. Similar to information retrieval [1], a query vector is used to search for TFBSs. Two supervised approaches to constructing a query vector were proposed. One is named the negative-to-positive vector (NPV) method. The other is named the optimal discriminating vector (ODV) method. NPV and ODV are supervised methods because they learn from known TFBSs and non-TFBSs. It was shown that, in this framework, the best subspace can be identified for each individual TF. This important advantage contributes to the superior performance of the NPV and ODV methods to a state-of-the-art method named ULPB [69].

0.2 Computational TFBS Analysis

3 A transcription factor is a protein that regulates the expression of its target genes by physically binding to the regulatory regions of these genes. The binding sites of a transcription factor (TF) naturally share similarity with one another. The common pattern shared among the binding sites of a TF is called a motif. In general, there are two approaches to computational motif analysis, de novo motif discovery [3, 85, 5, 11, 76, 79, 66, 4, 55, 91, 89, 29] and transcription factor binding site (TFBS) search [63, 13, 34, 69, 24]. As the name suggests, de novo motif discovery algorithms ﬁnds over-represented patterns in sequences without prior knowledge of the binding TFs. The input to these algorithms is usually the upstream region sequences of genes putatively co-regulated by one or more common TFs. The output is one or more motifs or patterns whose instances are over-represented in the input sequences. On the other hand, a TFBS search algorithm takes binding site sequences of a TF as input. It learns from these known binding sites and builds a TF model out of them. The TF model can then be used to scan sequences for putative binding sites. While the two approaches are tightly connected, we focus on the TFBS search problem and assume that a TF has known binding sites available.

0.2.1 Transcription Factor Binding Site Alignment

A typical TFBS search algorithm requires aligned TFBSs. This requirement allows simple representations of TF models. Types of TF models include con-

4 sensus sequences, position-speciﬁc scoring matrices (PSSMs) [77], etc. The

PSSM method is a widely used method among the available TFBS search algorithms. Given aligned binding sites of a TF, the TF model is essentially a 4 l matrix, where l is the length of the binding sites. Column i of the × matrix stores the scores of matching the ith letter in a sequence of length l

(an l-mer) to nucleotides A, C, G and T, respectively. The score of an l-mer is then calculated by summing up the scores of letter 1 through letter l. Once constructed, the matrix of a TF can be stored in a database to scan sequences for binding sites of the TF in the future without resorting to the actual binding sites. In fact, many tools [74, 40, 71, 13, 83, 90, 41] depend on matrices stored in at least one of the databases, JASPAR [10], RegulonDB [27] and

TRANSFAC [59]. Since a matrix is constructed from aligned binding sites, we cannot overemphasize the quality of TFBS alignments.

Databases such as JASPAR, TRANSFAC and ORegAnno [32] contain DNA segments bound by TFs. These DNA segments can be seen as TFBSs with some irrelevant bases on one or both sides because of the resolutions of techniques used to obtain TFBSs. The DNA segments belonging to a TF are therefore unaligned variable-length sequences. While the DNA segments for most TFs in the JASPAR database are aligned, this is not the case for the

TRANSFAC public and ORegAnno databases. About 53% (983 out of 1867)

5 of the TFs in the TRANSFAC Public database (release 7.0) have unaligned variable-length DNA segments. Moreover, nearly 78% (1447 out of 1867) of

TFs having curated DNA segments do not have a matrix. Focusing on TFs with variable-length DNA segments, about 71% (669 out of 983) of them do not have a matrix. On the other hand, the ORegAnno database stores experimentally validated DNA segments bound by TFs but does not provide matrices. About 31% (175 out of 572) of the TFs therein have variable-length

DNA segments. In the absence of a matrix, to search for binding sites of these TFs using a matrix dependent tool, one needs to ﬁrst align the curated DNA segments for each TF. In the rest of this proposal, we refer to

(variable-length) DNA segments containing TFBSs as (variable-length) TF-

BSs for simplicity reasons.

ChIP-seq data represents another important source of TFBSs. ChIP-seq [38] is a high-throughput technique used to determine the in vivo binding aﬃni- ties of a transcription factor to DNA on the whole-genome scale. The raw data from an ChIP-seq experiment is typically processed by a peak-ﬁnding algorithm to identify DNA segments containing binding signal peaks. These variable-length DNA segments can be seen as TFBSs with excessive irrelevant bases on both sides. To locate the actual TFBS within each DNA segment, a de novo motif discovery tool can be used to identify the motifs

6 present in those DNA segments.

0.2.2 Transcription Factor Binding Site Search

One assumption the PSSM representation makes is that positions in a binding site are independent, which is often not the case. Osada et al. [63] exploited dependence between positions by considering nucleotide pairs in scoring methods. It was shown that incorporating nucleotide pairs significantly improved the performance of a method, meaning that most transcription factors studied demonstrated correlation between positions in a binding site. This result was reinforced in a recent study [69], in which the authors showed correlations between two nucleotides within a binding site by plotting the mutual information matrix. A novel scoring method called the ungapped likelihood under positional background (ULPB) method was proposed in this study. The ULPB method models a TFBS by two ﬁrst-order

Markov chains and scores a candidate binding site by likelihood ratio produced by the two Markov chains. leave-one-out cross-validation results on

22 TFs with 20 or more binding sites showed that ULPB is superior to the methods compared in their work.

The PSSM method and methods alike consider only known binding sites of a TF, while information contained in non-binding sites is not exploited.

This is because explicit use of negative examples in the TFBS search prob-

7 lem is hindered by the vast amount of non-binding sites of a transcription factor. This is further aggravated by the low speciﬁcity of some transcription factors, where a binding site may be more similar to a non-binding site than some other binding sites. Owing to these issues, previous studies involving negative examples are limited and the roles of negative examples remain unclear. In a review article, Hannenhalli [34] surveyed work on improved motif models and integrative methods. None of these reviewed studies [34], however, investigated the use of negative examples on top of true TFBSs. While introducing improved benchmarks for computational motif discovery, Sandve et al. [72] described algorithms for ﬁnding optimal motif models using both positive and negative TFBSs. Three models were compared using the proposed benchmarks. However, no methods relying on only positive examples were compared. In a previous study, Do and

Wang [18] formulated the TFBS search problem as a classiﬁcation problem, proposed a novel similarity measure, and investigated three classiﬁcation techniques. Five-fold CV results showed that learning vector quantization performed better than P-Match [13], which requires only positive examples.

The evaluation, however, was done on only 8 human transcription factors and 8 artiﬁcial ones. It is not clear how the results on the small set of 8 real TFs can be generalized to other TFs. Lee and Huang [48] proposed to visualize TFBS’s in the context of negative examples. It was shown that,

8 using negative examples explicitly, the visualization technique aﬀords iden- tiﬁcation of a better similarity measure between short sequences in a TFBS search problem.

0.2.3 Chapter Organization

In this dissertation, we investigate transcription factor binding site search and visualization using machine learning techniques. Typically, a TFBS search algorithm cannot process and learn from unaligned binding sites.

Therefore, the TFBS search problem consists of two sub-problems. One is aligning the known variable-length binding sites of a TF, while the other is searching for novel binding sites given the aligned TFBSs. TFBS alignment can be seen as an unsupervised learning problem since no other information about the unaligned TFBSs is given. In Chapter 1, we describe an algorithm that considers the lengths of TFBSs and iteratively applies a TFBS search method to scanning unaligned TFBSs [50, 51, 47]. That is, a supervised learning algorithm is used at each iteration to deal with this unsupervised learning problem. The method is named LASAGNA (Length-Aware Site

Alignment Guided by Nucleotide Association). We show that LASAGNA signiﬁcantly outperforms two widely used algorithms for TFBS alignment.

Coupling LASAGNA with a TFBS search method, we further show that it performs better than an alignment-free method accepting variable-length input TFBSs.

9 Studies often utilize TFBS search tools to predict the binding sites of a TF in a DNA sequence when binding sites found by assays are not available. The analysis often involves TF model collection, promoter sequence retrieval and visualization, requiring several tools to accomplish. To accelerate TFBS analyses, we developed a novel integrated webtool, allowing users to perform the analysis without leaving the site. Important features include accepting unaligned variable-length TFBSs, collections of 1792 TF models, automatic promoter sequence retrieval, visualization in the UCSC Genome Browser

[19] and gene regulatory network inference/visualization based on binding speciﬁcities. We describe this user-friendly webtool in detail in Chapter 2.

Most of the TFBS search methods consider only known binding sites of a

TF and hence deal with one-class classiﬁcation problems. However, non- binding sites contain information about the TF as well. When non-binding sites are available, searching for TFBSs becomes a two-class classiﬁcation problem. In Chapter 3, we describe the NPV and ODV methods, which utilize both binding sites and non-binding sites of a TF. We show the performance gain of the NPV and ODV methods over the ULPB method through cross-validation experiments and independent validation experiments on the whole genome scale.

10 Chapter 1 LASAGNA: An Unsupervised Learning Algorithm for TFBS Alignment

1.1 Background

In this chapter, we describe a novel TFBS alignment algorithm named

LASAGNA (Length-Aware Site Alignment Guided by Nucleotide Asso- ciation). The algorithm is based on the hypothesis that binding sites of a

TF share a core [12], a short and highly conserved stretch of DNA. Hence, a binding site can be seen as a core with some irrelevant bases on one or both sides. In general, shorter sites tend to contain fewer irrelevant bases and are easier to align. For this reason, we progressively align the binding sites from the shortest to the longest ones. The algorithm further exploits dependence between two positions in a binding site. Dependence between positions has been shown to boost performance of TFBS search algorithms [63, 69] as well as protein structural motif recognition [44]. To our best knowledge, this idea has never been applied to multiple sequence alignment. We further describe a more sophisticated version, named LASAGNA-ChIP,for aligning peak sequences produced by ChIP-seq experiments.

11 To compare algorithms for TFBS alignment, we conduct cross-validation

(CV) experiments on 4771 binding sites of 189 TFs across 5 species extracted from the TRANSFAC Public database (release 7.0). We compare LASAGNA to ClustalW2 [81, 45] and MEME [3]. ClustalW2 [81, 45] is a widely used multiple sequence alignment tool. It aligns sequences by ﬁrst constructing a guide tree based on pairwise sequence alignments. The guide tree then de- termines the order in which the sequences are aligned. Although ClustalW2 was not speciﬁcally designed for aligning TFBSs, it was used to produce gapped TFBS alignments in creating the MAPPER database [58] as well as to produce both gapped and gapless TFBS alignments in [69]. ClustalW2 and other similar algorithms focus on producing structurally correct alignments, while other improved algorithms rely on structural or homology information [61]. ClustalW2 can be viewed as a representative of these algorithms when no information other than sequences is available.

MEME [3], on the other hand, is a widely adopted de novo motif discovery tool. It is used to ﬁnd over-represented patterns or motifs in input sequences. It is capable of locating the motif instances in each input sequence, while it can also handle the case when an input sequence does not contain an instance of the motif. Since the binding sites of a TF share a common

12 pattern or motif, a de novo motif discovery tool can be used to locate the motif instance in each binding site. This shared motif can then be used to align the binding sites. Therefore, in this dissertation, we view MEME as a TFBS alignment tool rather than a de novo motif discovery tool. In fact,

MEME is employed by the PAZAR database [64] to dynamically align TF-

BSs and generate PSSMs. It is also used by 5 out of 6 tools compared in

[80] for ChIP-seq data analysis. We show that LASAGNA signiﬁcantly out-

15 15 performs ClustalW2 (p-value: 1.22 10− ) and MEME (p-value: 3.55 10− ). × ×

To scan promoters for new TFBSs based on variable-length known TFBSs, we couple a PSSM method with LASAGNA, denoted by LASAGNA-PSSM.

That is, the input variable-length TFBSs are aligned by LASAGNA and a PSSM model is built from the alignment. It is useful to compare an alignment-based TFBS search method to an alignment-free method. There- fore, we further compare LASAGNA-PSSM to SiTaR [24], which accepts variable-length input TFBSs. To our best knowledge, SiTaR is the only alignment-free method capable of handling variable-length input TFBSs at the time of writing. Cross-validation results on 90 TFs whose binding sites can be located in respective genomes indicate that LASAGNA-PSSM

8 is signiﬁcantly more precise at ﬁxed recall rates (p-value: 2.66 10− ). The × recall-precision curve also shows that our method is constantly more precise

13 at any recall rate and more sensitive at any precision.

Finally, we demonstrate the application of LASAGNA-ChIP to ChIP-seq data using 38 mouse ChIP-seq experiments. We show that, assuming the one-per-sequence model, LASAGNA-ChIP is comparable to MEME in re- vealing the motif of the ChIPed TF or its cofactor. For both LASAGNA-ChIP and MEME, the ChIPed TF motif was found in 31 experiments, while a cofactor motif was found in 3 experiments. While the two methods diﬀer in the rest 4 experiments, the found motifs have similar information content and may belong to unknown cofactors.

1.2 Methods

We describe our novel alignment algorithm in this section. LASAGNA utilizes a search module to align a new binding site with a partial alignment. Thus, we introduce the search module followed by the LASAGNA algorithm.

1.2.1 The Search Module

The search module of LASAGNA is a function learned from a (partial) TFBS alignment to score l-mers. It considers nucleotide pairs in addition to individual nucleotides so as to exploit dependence between positions. We introduce our choice of the search module, the PSSM model described in

14 [63]. We denote it by PSSMa ( ) in this chapter. K ·

Suppose that a PSSM is constructed from aligned sequences of length l. The score of letter u at position i is given by

fi(u) M (u) = log , i f (u)

where fi(u) is the probability of observing letter u at position i and f (u) is the background probability of seeing letter u. Similarly, the score of a pair of letters (u, v) at position (i, j) is given by

fi,j(u, v) M (u, v) = log , i,j f (u, v)

where fi,j(u, v) is the probability of observing nucleotide pair (u, v) at position

(i, j) and f (u, v) is the background probability of seeing the pair. The score of s, a sequence of length l, is then

l K l k X X X− PSSMK(s) = Mi(si) + Mi,j(si, sj), (1.2.1) i=1 k=1 i=1

th where si denotes the i letter of s, j = i + k and K is the scope parameter deﬁned in [63]. The parameter K controls how far apart a pair of nucleotides can be. When K = 1, only adjacent nucleotide pairs are scored. We deﬁne

Pl PSSM0(s) = i=1 Mi(si), that is, we do not score nucleotide pairs when K = 0.

15 Our search module is a variant of (1.2.1). Let   minx Mi(x) if u is the gap letter  M0(u) = i   Mi(u) otherwise and   minx y Mi j(x, y) if u or v is the gap letter  , , M0 (u, v) = . i,j   Mi,j(u, v) otherwise

The search module is deﬁned as follows:

l K l k X X X− a PSSMK(s) = Mi0(si) + Mi0,j(si, sj), (1.2.2) i=1 k=1 i=1 where superscript a denotes alignment as this module is used in our alignment algorithm.

1.2.2 The LASAGNA Algorithm

The algorithm is based on the idea that the binding sites of a TF share a common core, a conserved short DNA sequence. A binding site can then be seen as a core with a few irrelevant bases on one or both sides. Assuming that each binding site fully contains the core, the shorter a binding site, the fewer irrelevant bases it contains. Therefore, we progressively align the binding sites by aligning the shortest binding site with the already aligned binding sites until all the binding sites are aligned.

16 The algorithm takes a set of unaligned binding sites, U, and parameter Ka as inputs. Let A denote the set of aligned binding sites. A binding site in A may have gap letters added to one or both ends as a result of the alignment.

The algorithm works as follows:

(i) Initialize A to s , where s, the seed site, a shortest binding site arbitrarily { } chosen from U. Remove s from U.

(ii) (a) Build PSSMa ( ) from A. Let the length of this PSSM be l. Ka ·

(b) Remove the shortest binding site s from U.

(d) Score each l-mer of S by PSSMa ( ) to ﬁnd the highest scoring one. Ka ·

(e) Let s be its reverse-complement and repeat c–d. That is, the opposite

strand is considered.

(iii) Add s to A if the highest scoring l-mer resides in s. Otherwise, add its

reverse-complement to A. Gap letters are added to one or both ends

of sequences in A. This ensures that they are all of the same length

and each column of the alignment has at least one non-gap letter.

(iv) Repeat ii–iii until U is empty.

In step iib, there may be more than one shortest binding sites in U. To break the tie, we use PSSMa ( ) to scan each of the shortest ones. The “s” contain- Ka · 17 (a) -GCGCTAA-- (c) ----GCGCTAA-- --CGCCAAA------CGCCAAA- -GCGCCAAA- ----GCGCCAAA- -GCGCCAAA- ----GCGCCAAA- -GCGCCAAA- (b) PSSM ----GCGCCAAA- -GCGCGAAA- ----GCGCGAAA- -GCGCCAAT- ----GCGCCAAT- -CCGCCAAA- ----CCGCCAAA- -CCGCGAAA- ----CCGCGAAA- A --CGCGGAAA A -----CGCGGAAA -GCGCGAAG- ----GCGCGAAG- -GCGGGAAA- ----GCGGGAAA- -GCGCGATC- ----GCGCGATC- -CCCGGAAA- ----CCCGGAAA- CGCGCCAAA- ---CGCGCCAAA- -GCGCGAAAA ----GCGCGAAAA CCCGCCAGG- ---CCCGCCAGG- TTTCCCGCCAA--

------TTT CCCGCCAA------TTTCCCGCCAA TTTCGCGCCAAA U TTTCGCGCCAAA U TTTGGCGGGCGGCC TTTGGCGGGCGGCC CAATTTTCGCGCGG CAATTTTCGCGCGG CCATTTTCGCGGGAA CCATTTTCGCGGGAA

Figure 1.2.1: An Illustration of LASAGNA with Ka = 0

ing the highest scoring l-mer is removed from U to align with sequences in

A. In the unlikely case of two or more shortest binding sites in U sharing the same highest score, one is arbitrarily chosen. Figure 1.2.1 illustrates an iteration of the algorithm.

An alignment may be trimmed before building a PSSM. We describe one way of trimming aligned TF binding sites using two simple measures. Let l be the length of the aligned binding sites. We ﬁrst compute and denote the percentage of non-gap letters at position i of the alignment by Ci, for i = 1, 2,..., l. The information content (IC) at each position is then computed with small sample correction described in [73]. That is,

   X    ICi = max 0, 2 + fi(u) log fi(u) eˆ(ni) ,  2 −   u A, C, G, T  ∈{ } where i 1, 2,..., l , ni is the number of non-gap letters at position i and ∈ { } 18 eˆ( ) gives the approximated sampling error. Let C and IC be the cutoﬀ · min min thresholds. The alignment is examined from the left end to the right until the

ﬁrst position j satisfying both Cj > Cmin and ICj > ICmin is encountered. The positions preceding j are trimmed oﬀ. The trimming is similarly applied to the right end.

1.2.3 LASAGNA for ChIP-seq Data

Although LASAGNA is not speciﬁcally designed as a de novo motif discovery algorithm, a more sophisticated version, named LASAGNA-ChIP, is capable of handling ChIP-seq data. Here, we refer to the previous section and describe the additional steps that are necessary for aligning ChIP-seq peak sequences. The ﬂowchart in Figure 1.2.2 gives an overview of LASAGNA-

ChIP.

Before aligning ChIP-seq peak sequences, each sequence is clipped to 100 bp surrounding the signal peak. This is a common practice since, for most peak sequences (> 90%), the actual TFBS is usually found within 50 bp of the called peak [38]. In step iia, we trim the partial alignment A if it contains more than two sequences. Unlike TFBSs found in databases such as TRANSFAC, even after clipping, a peak sequence contains much more irrelevant bases ﬂanking the core. The trimming procedure described in the

19 Overview of the LASAGNA-ChIP algorithm

ChIP-seq peak sequences LASAGNA with trimming Initial Refined Clip to 100 bp Alignments Alignments around signal peak Alignment 0 Alignment 0

Clipped peak Alignment 1 Alignment 1 sequences Alignment 2 Alignment 2 Refinement Seed sites Alignment 3 Process Alignment 3 Shortest site Alignment 4 Alignment 4 Arbitrary site 1 Alignment 5 Alignment 5 Arbitrary site 2

Arbitrary site 3

Arbitrary site 4

Arbitrary site 5 Alignment with the highest IC

Figure 1.2.2: LASAGNA-ChIP Flowchart

previous section is used, where Cmin (ICmin) is set to the mean Ci (ICi) over all the columns of A. The resulting alignment is further trimmed by IC such that it has at most 15 columns and the columns on both ends have positive

IC. In step iib, if there are more than 5 shortest binding sites in U. Only 5 are arbitrarily chosen to break the tie by similarity to PSSMa ( ). Ka ·

The alignment A obtained by the modiﬁed procedure is further reﬁned as follows:

(i) Set T to A trimmed to l columns as described above.

(ii) Build PSSMa ( ) out of T. Ka ·

(iii) Initialize R to , the reﬁned partial alignment. {}

(iv) For each peak sequence s,

20 (a) Create S, the augmented sequence of s, by adding l 1 gap letters − to both ends of s.

(b) Score each l-mer of S by PSSMa ( ) to ﬁnd the highest scoring one. Ka ·

(d) Add s to R if the highest scoring l-mer resides in s. Otherwise, add

its reverse-complement to R. Gap letters are added to one or both

ends of sequences in R.

(v) Set A to R and repeat i–v until the sum of IC across columns of T does

not change in 3 iterations.

For ChIP-seq peak sequences, the shortest sequence may miss or contain only a fraction of the core. Hence, using the shortest sequence as the seed site sometimes results in an alignment with less IC. For this reason, ﬁve additional sequences are arbitrarily chosen as the seed site to produce 5 additional alignments. Among the 6 alignments, the one with the most IC after trimming is chosen as the ﬁnal alignment.

1.2.4 Scoring a Putative Binding Site

Although a PSSM suggests the length of a putative binding site, we do not restrict the length of a candidate binding site to the length of the PSSM. A putative binding site could be of any reasonable length. If a true binding site is ﬂanked by a few irrelevant bases, this sequence should be given a

21 relatively high score compared to those of non-binding sites. Therefore, to score a putative binding site s, we slide s through the PSSM as described in the previous section. The score of sequence s is given by

ScoreKs (s) = max PSSMKs (Si:(i+l 1)), (1.2.1) i 1,2,...,l+ls 1 − ∈{ − }

where l is the length of the PSSM, ls is the length of s, S denotes the augmented sequence of s with l 1 gap letters on both ends and PSSMK ( ) is deﬁned in − s · (1.2.1).

1.3 Results and Discussion

1.3.1 Comparison of Alignment Algorithms

1.3.1.1 Data sets

Wedownloaded all the TF binding sites from the TRANSFAC Public database

(release 7.0). The binding sites were grouped by species and TF. Binding sites having less than 4 nucleotides were discarded. TFs of each species were ﬁltered such that each TF has at least 10 binding sites. This ensures that each TF has enough binding sites to construct a PSSM. The numbers of

TFs and TFBSs are listed in Table 1.3.1.

To facilitate experiments, we planted each TFBS in a 2000 base random sequence simulated by a ﬁrst-order Markov chain of the species in question.

22 Table 1.3.1: TFBSs in TRANSFAC Public Database by Species

Species # TFs1# TFBSs2 Homo sapiens 68 1984 Mus musculus 53 966 Rattus norvegicus 26 633 Drosophila melanogaster 29 935 Saccharomyces cerevisiae 13 253 Overall 189 4771 1The total number of TFs 2The total number of TFBSs

Except for Saccharomyces cerevisiae, the Markov chain of a species was learned from promoter sequences in the UCSC Genome Browser database

[19]. For Saccharomyces cerevisiae, the promoter sequences were retrieved from the SCPD [93] using the yeast gene list in euGenes [30].

1.3.1.2 Performance assessment and evaluation metrics

Since the purpose of aligning TFBSs is to construct a PSSM, the quality of an alignment is best measured by the search performance of the PSSM. The performance of a TFBS search method is evaluated by ν-fold CV. Consider a TF with n binding sites. The n TFBSs are ﬁrst divided into ν sets, each of which contains n or n + 1 TFBSs. At each iteration of the ν-fold CV, one b ν c b ν c of the ν TFBS sets called the test TFBS set, Ptest, is left out. The rest of the

TFBSs are aligned to build a PSSM. Each test TFBS in Ptest is then planted in a 2000 base random sequence and scanned by the PSSM, scoring each l-mer, where l is the length of the test TFBS. We score both the forward and

23 reverse strands of an l-mer and assign the higher score to it. An l-mer is considered a hit if it shares more than l/2 bases with the test TFBS. The b c l-mers can then be divided into two sets, H and N, where H is the set of hits and N is considered the set of non-binding sites. The score of the test TFBS is the highest score of hits in H. For each test TFBS t P , we ﬁnd its rank ∈ test relative to all the non-binding sites in N. Formally, the rank of binding site t equals 1 + s N ScoreK (s) ScoreK (t) . |{ ∈ | s ≥ s }|

After the ν-fold CV, we end up with n ranks, each of which corresponds to a

TFBS. We use the area under the ROC curve (AUC) to gauge the quality of alignment. The ROC curve is a plot of true positive rate (TPR) against false positive rate (FPR), displaying the trade-oﬀ between TPR and FPR. We refer readers to [23] for an introduction to this metric. In this study, ν = 10 for all the CV experiments.

1.3.1.3 Comparison with ClustalW2

In general, gapless alignment is preferred over gapped alignment for aligning TFBSs. Because of the nature of ClustalW2, the alignment of TFBSs may contain gaps in the middle of some binding sites. This is disadvantageous to

ClustalW2 as the PSSM method does not allow insertion of gaps into the sequence being scanned. Hence, we turned oﬀ gaps by setting the gap opening penalty parameters to a large value, i.e., we set both GAPOPEN and PWGAPOPEN

24 to 100000. Indeed, results indicated that overall the “gapped” ClustalW2 performs slightly worse than the “gapless” variant (p-value: 0.277). For both

LASAGNA and ClustalW2, parameter Ks in Eq. 1.2.1 was searched from 0 to min 10, l 1 for each TF and the one producing the highest AUC is used, { min − } where lmin is the minimal length of the TFBSs. For LASAGNA, parameter Ka of the LASAGNA algorithm was set to Ks as the two parameters are closely related.

We conducted 10-fold CV on each TF. The overall ROC curves are shown in

Figure 1.3.1. The ROC curves are based on the ranks of 4771 TFBSs of 189

TFs. It shows that LASAGNA has invariably higher true positive rate than

ClustalW2. The AUC score was calculated for each TF and for each method.

To gauge the significance of difference, the Wilcoxon signed-rank test [87] was performed for each species. The tests showed that LASAGNA is consistently better than ClustalW2 across the 5 species. Table 1.3.2 shows the test results. Overall, LASAGNA performed significantly better than ClustalW2 in terms of AUC scores. The species-wise p-values shows that LASAGNA is significantly better (< 0.05) than ClustalW2 for aligning TFBSs of all the 5 individual species.

To better understand the results, we split the 189 TFs into two groups. One

25 0.7 1.00 0.6 0.95 0.5 0.90 0.4 0.85 0.3 True Positive Rate Positive True 0.80 0.2 0.75 0.1 LASAGNA ClustalW2 MEME 0.0 0.70

0.000 0.010 0.020 0.0 0.1 0.2 0.3 0.4 0.5 0.6

False Positive Rate False Positive Rate

Figure 1.3.1: Overall ROC Curves for the Three Alignment Algorithms

contains TFs on which LASAGNA performed better than ClustalW2 and the other contains the rest of the TFs. Three factors are examined for each TF.

They are the number of TFBSs, the mean and standard deviation of TFBS length. For each factor, we looked for diﬀerence between the two groups.

Table 1.3.3 shows the comparisons. It can be seen that LASAGNA produces better alignments when a TF has fewer binding sites but the diﬀerence is not signiﬁcant. The mean and standard deviation of TFBS length are the two more important factors. We believe that LASAGNA is well-suited for aligning TFBSs that are longer and more variable in length.

1.3.1.4 Comparison with MEME

The MEME tool in the MEME Suite 4.8.1 was used. The parameter minw,

26 Table 1.3.2: Species-wise and Overall Comparisons between LASAGNA and ClustalW2

Species # better1# ties2# TFs3 p-value4 7 H. sapiens 54 (79.4%) 0 68 4.42 10− × 5 M. musculus 42 (79.2%) 0 53 1.41 10− × 4 D. melanogaster 22 (75.9%) 0 29 9.89 10− × 2 S. cerevisiae 9 (69.2%) 1 13 3.88 10− × 3 R. norvegicus 20 (76.9%) 1 26 1.54 10− × 15 Overall 147 (77.8%) 2 1891.22 10− × 1Number of TFs on which LASAGNA performs better than ClustalW2. 2Number of TFs on which LASAGNA and ClustalW2 have the same performance. 3Total number of TFs for a species. 4Wilcoxon signed-rank test p-value.

minimal width of motifs, was set to the smaller of 6 and the minimal length of input TFBSs. The option revcomp to search the reverse strand was turned on. Finally, the parameter minsites was set to the number of input TFBSs since a common motif is supposed to appear at least once in each TFBS. To ensure that MEME functions properly, binding sites shorter than 8 bases are padded with gap letters since genomic locations are not available for most

TFBSs.

The experiments were carried out in the same manner as the ClustalW2 experiments. The overall ROC curve in Figure 1.3.1 indicates that LASAGNA has consistently higher true positive rates than MEME across diﬀerent false positive rates. The overall and species-wise comparisons between

LASAGNA and MEME in Table 1.3.4 show that LASAGNA performed sig-

27 Table 1.3.3: Comparison of Two Groups of TFs Divided According to Results on LASAGNA and ClustalW2

Factor Group 11 meanGroup 22 meanp-value3 # TFBSs4 25.07483 25.83333 0.1409 Mean of TFBS length 18.78626 17.56167 0.08451 SD of TFBS length5 8.180204 6.921905 0.06295 1LASAGNA performed better than ClustalW2 on TFs in this group. 2ClustalW2 performed better than or equal to LASAGNA on TFs in this group. 3Wilcoxon signed-rank test p-value. 4Number of binding sites for each TF. 5Standard deviation of binding site length for each TF.

nificantly better than MEME. To gain some insights into the difference between LASAGNA and MEME, we similarly examined the three factors used to compare LASAGNA and ClustalW2. As seen in Table 1.3.5, the number of input TFBSs is the only significant (p-value < 0.05) factor out of the three.

The reasons are not clear but may be investigated in the future. Moreover, it will be helpful to identify other (biologically meaningful) factors that can better explain the performance diﬀerence.

1.3.1.5 Distribution of Ks

In Figures 1.3.2, 1.3.3 and 1.3.4, for LASAGNA, ClustalW2 and MEME, we show the distribution of Ks for a TF by species and conserved domain. Over- all, we observe that small values are preferred for all three methods. By visual inspection, LASAGNA appears more similar to MEME than ClustalW2 in the usage of Ks. It can be seen that the usage of Ks diﬀers among diﬀerent conserved domains. Related conserved domains such as ZF-H2C2 2 and

28 Table 1.3.4: Species-wise and Overall Comparisons between LASAGNA and MEME

Species # better1# ties2# TFs3 p-value4 3 H. sapiens 41 (60.3%) 0 68 7.83 10− × 6 M. musculus 41 (77.4%) 0 53 8.79 10− × 7 D. melanogaster 26 (89.7%) 0 29 1.02 10− × 3 S. cerevisiae 10 (76.9%) 3 13 2.96 10− × 4 R. norvegicus 23 (88.5%) 1 26 1.73 10− × 15 Overall 141 (74.6%) 4 1893.55 10− × 1Number of TFs on which LASAGNA performs better than ClustalW2. 2Number of TFs on which LASAGNA and ClustalW2 have the same performance. 3Total number of TFs for a species. 4Wilcoxon signed-rank test p-value.

ZF-C2H2 display similar patterns. This is not surprising as conserved domains in a protein are often computationally predicted. Hence, a protein is likely to possess related conserved domains. While overall the distributions seem method-dependent, we observe that, for ZF-H2C2 2 and ZF-C2H2, the distributions center around 4 across all three methods. Finally, we note that these observations are preliminary and more TFs are needed to draw statistically sound conclusions.

1.3.2 Comparison of TFBS Search Methods

1.3.2.1 Data sets

To compare with an alignment-free TFBS search method, SiTaR, [24], we retrieved real promoter sequences embedding TFBSs. Speciﬁcally, we followed the curated location of each binding site in the TRANSFAC Public

29 Table 1.3.5: Comparison of Two Groups of TFs Divided According to Results on LASAGNA and MEME

Factor Group 11 meanGroup 22 meanp-value3 # TFBSs4 23.33333 30.85417 0.03196 Mean of TFBS length 18.33468 19.04125 0.3007 SD of TFBS length5 7.95844 7.730625 0.1846 1LASAGNA performed better than ClustalW2 on TFs in this group. 2ClustalW2 performed better than or equal to LASAGNA on TFs in this group. 3Wilcoxon signed-rank test p-value. 4Number of binding sites for each TF. 5Standard deviation of binding site length for each TF.

database (release 7.0) to retrieve the 1000-base sequences ﬂanking the binding site. We discarded binding sites that cannot be found in the proximity of the curated locations. The retrieved binding sites were grouped by TF and TFs having less than 10 binding sites were removed. After ﬁltering, we ended up with 90 TFs and 1751 binding sites. A TF may be present in more than one species as homologs and hence the binding sites of a TF may be located in genomes of multiple species. The species and respective numbers of binding sites are shown in Table 1.3.6.

1.3.2.2 Performance assessment and evaluation metrics

To compare with SiTaR [24], we adopt the same ν-fold CV process used to compare LASAGNA with ClustalW2 and MEME. However, we do not assume that a TFBS search method scores all the l-mers in a promoter sequence, where l is the length of binding sites. Instead, a TFBS search method scans a promoter sequence and predicts a list of binding sites with respective

30 Table 1.3.6: Distribution of the 1751 Binding Sites of 90 TFs in TRANSFAC Public Database

Species # TFBSs1 Homo sapiens 735 Mus musculus 346 Rattus norvegicus 278 Saccharomyces cerevisiae 158 Drosophila melanogaster 155 Gallus gallus 73 Bos taurus 5 Sus scrofa 1 1Total number of TFBSs.

scores. The predicted binding sites may be of diﬀerent lengths, which is the case for SiTaR.

We describe how a hit is determined. Let the length of a predicted binding site be l and the length of the test TFBS in question be ls. The predicted binding site is considered a hit to the test TFBS if the overlap between the two sequences is more than l /2 bases as in [24]. In case this is not possible, b s c i.e., l l /2 , the predicted binding site must be embedded in the true one ≤ b s c to be deemed a hit.

Using the n ranks of TFBSs from ν-fold CV, we compute recall (true positive rate), precision and the Fβ-measure, where β = 0.5 as in [24]. Let the recall rate be r. The number of TFBSs recalled by the method is p = n r. Let T × 31 the number of non-binding sites or false positives introduced be pF. The

p precision recall T 2 × precision is given by p +p , while Fβ = (1 + β ) β2 precision+recall . T F ×

1.3.2.3 Comparison with an alignment-free method

We conducted 10-fold CV on the aforementioned 90 TFs. The PSSM method dependent on LASAGNA (LASAGNA-PSSM) was compared to SiTaR [24].

LASAGNA considered both strands of a sequence when aligning binding sites. The parameters Ka = Ks were determined in the same way as in comparing LASAGNA to ClustalW2. An alignment was trimmed with

Cmin = 0.4 and ICmin = 0 before constructing a PSSM as described in the method section on the LASAGNA algorithm. The PSSM method uses a cut- off score to predict TFBSs. The cutoff score is set to the minimal score of the constituting binding sites of the PSSM. The SiTaR method has a mismatch parameter and the maximal value allowed by its webtool is 5. We searched in the range from 0 to 5 to find the mismatch value giving the highest Fβ- measure for each TF.

In terms of the Fβ, no significant difference was found between the two methods (p-value: 0.392 [87]). To ensure a fair comparison, we fixed the recall rate for each TF and compare the precision achieved by LASAGNA-PSSM and SiTaR. The recall rate was set to the lower of the recall rates attained by

LASAGNA-PSSM and SiTaR. The TF c-Jun (AC: T00132) was excluded from

32 comparison because SiTaR did not recover any TFBS. Figure 1.3.5a shows the plot of precision by LASAGNA-PSSM against that by SiTaR. At ﬁxed recall rates, LASAGNA-PSSM is more precise than SiTaR on 65 out of 89

8 TFs (p-value: 2.66 10− ). Figure 1.3.5b shows the plots of precision against × recall based on all the recalled TFBSs by each method. It can be seen that

LASAGNA-PSSM is constantly more precise than SiTaR at the same recall rate. Moreover, LASAGNA-PSSM recovered substantially more TFBSs than

SiTaR at the same precision.

Results reported in [24] showed that SiTaR is highly precise and sensitive.

Although SiTaR accepts variable-length binding sites, all the experiments presented in [24] used ﬁxed-length binding sites as inputs. It is therefore not clear how SiTaR performs on TFs having variable-length binding sites. It is also not clear whether SiTaR preprocesses highly variable-length binding sites as this was not stated in [24]. These issues however are not the focus of this dissertation.

1.3.3 Application of LASAGNA-ChIP to ChIP-seq Data

To demonstrate the use of LASAGNA-ChIP on ChIP-seq data, we retrieved mouse ChIP-seq data produced by the Encyclopedia of DNA Elements

(ENCODE) project [14] from the UCSC Genome Browser [19]. All the 38 peak

ﬁles in the Narrow Peaks format that matches pattern ftp://hgdownload.

33 cse.ucsc.edu/goldenPath/mm9/database/wgEncode*Tfbs*Pk* were downloaded on Oct. 12, 2012, where “*” is the wildcard character matching zero or more characters. These files give signal peak location besides start and end for each peak and hence the corresponding signal files do not need to be processed by a peak-finding algorithm. Four distinct cell types and 17 distinct target TFs are present in the 38 ChIP-seq experiments. Table 1.3.7 lists, for each ChIP-seq experiment, the cell, target TF, number of peaks as well as the minimum, maximum, mean and standard deviation of peak length. We observe that the peak length varies greatly. The mean peak length can be as long as 1124, while the highest standard deviation is nearly 876.

34 LASAGNA

Overall Homo sapiens Saccharomyces cerevisiae 14 3.0 30 12 2.5 25 10 2.0 20 8 1.5 15 6 Count Count Count 1.0 10 4 5 2 0.5 0 0 0.0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

Mus musculus Rattus norvegicus Drosophila melanogaster 7 6 10 6 5 8 5 4 6 4 3 3 Count Count Count 4 2 2 2 1 1 0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

HOX BZIP_1 BRLZ 5 6 10 5 8 4 4 6 3 3 Count Count Count 4 2 2 2 1 1 0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

ZF−C4 BZIP_2 HOLI 5 5 5 4 4 4 3 3 3 Count Count Count 2 2 2 1 1 1 0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

ZF−H2C2_2 POU ZF−C2H2 5 6 5 5 4 4 4 3 3 3 Count Count Count 2 2 2 1 1 1 0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

Figure 1.3.2: Distribution of Ks by Species and Conserved Domain (LASAGNA)

35 ClustalW2

Overall Homo sapiens Saccharomyces cerevisiae 10 3.0 30 2.5 8 25 2.0 6 20 1.5 15 Count Count Count 4 1.0 10 2 5 0.5 0 0 0.0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

Mus musculus Rattus norvegicus Drosophila melanogaster 6 4 10 5 3 8 4 6 3 2 Count Count Count 4 2 1 2 1 0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

HOX BZIP_1 BRLZ 8 7 8 6 6 5 6 4 4 4 3 Count Count Count 2 2 2 1 0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

ZF−C4 BZIP_2 HOLI 6 6 6 5 5 5 4 4 4 3 3 3 Count Count Count 2 2 2 1 1 1 0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

ZF−H2C2_2 POU ZF−C2H2 6 4 5 5 4 3 4 3 3 2 Count Count Count 2 2 1 1 1 0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

Figure 1.3.3: Distribution of Ks by Species and Conserved Domain (ClustalW2)

36 MEME

Overall Homo sapiens Saccharomyces cerevisiae 30 3.0 14 25 2.5 12 20 10 2.0 8 15 1.5 Count Count Count 6 10 1.0 4 5 0.5 2 0 0 0.0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

Mus musculus Rattus norvegicus Drosophila melanogaster 7 6 12 6 5 10 5 8 4 4 6 3 3 Count Count Count 4 2 2 2 1 1 0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

HOX BZIP_1 BRLZ 6 5 10 5 4 8 4 3 6 3 Count Count Count 2 4 2 1 2 1 0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

ZF−C4 BZIP_2 HOLI 4 5 4 4 3 3 3 2 2 Count Count Count 2 1 1 1 0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

ZF−H2C2_2 POU ZF−C2H2 5 5 4 4 4 3 3 3 2 Count Count Count 2 2 1 1 1 0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K K K

Figure 1.3.4: Distribution of Ks by Species and Conserved Domain (MEME)

37 (a) (b) ●

1.0 1.0 LASAGNA−PSSM ● ● ● ●● ● ● ● ● ● ● SiTaR ● ● ● ● ● ● ● 0.8 ●● ● 0.8 ● ● ● ● ● ● ● ● ● ● ●

0.6 ● 0.6 ● ● ● ● ● ● ● ● Precision

0.4 ● 0.4 ● ● ● ●

● LASAGNA−PSSM Precision LASAGNA−PSSM ● ● ● ● ●

0.2 ● 0.2 ● ●

● ●● ●● ● ● ●● ● ● ●● ●● ●●● ●● 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000 1200

SiTaR Precision Number of Recalled TFBSs

Figure 1.3.5: Comparison of the PSSM Method Dependent on LASAGNA to SiTaR

38 Table 1.3.7: List of 38 ChIP-seq Experiments

Track Cell TF # PeaksMin Max Mean Std

wgEncodeSydhTfbsCh12Bhlhe40nb100IggrabPk CH12 BHLHE40 50576 110148 628.9630.3

wgEncodeSydhTfbsMelBhlhe40cIggrabPk MEL BHLHE40 11919 1 5609 581.5584.7

wgEncodeCaltechTfbsC2c12CebpbFCntrl50bE2p60hPcr1xPkRep1 C2C12 CEBPB 7936 100 1397 304.5101.3

wgEncodeCaltechTfbsC2c12CebpbFCntrl50bPcr1xPkRep1 C2C12 CEBPB 11434 104 1455 349.7125.3 39 wgEncodeCaltechTfbsC2c12CebpbFCntrl50bE2p60hPcr1xPkRep2 C2C12 CEBPB 7339 119 1390 351.1101.6

wgEncodeSydhTfbsMelEts1IggrabPk MEL ETS1 14368 1 9720 818.5643.4

wgEncodeSydhTfbsCh12Ets1IggrabPk CH12 ETS1 41149 110484 676.1652.2

wgEncodeCaltechTfbsC2c12Fosl1sc605FCntrl36bPcr1xPkRep1 C2C12 FOSL1 6507 125 1035 346.5105.5

wgEncodeSydhTfbsMelGata1Dm2p5dStdPk MEL GATA1 23793 1 4461 461.4251.0

wgEncodeSydhTfbsMelGata1IggratPk MEL GATA1 60511 1 5608 473.6362.4

Continued on next page Table 1.3.7 – continued from previous page

Track Cell TF # PeaksMin Max Mean Std

wgEncodeSydhTfbsCh12CjunIggrabPk CH12 JUN 16619 1 6938 563.9396.6

wgEncodeSydhTfbsMelJundIggrabPk MEL JUND 1638 2 4946 563.1409.0

wgEncodeSydhTfbsCh12JundIggrabPk CH12 JUND 14821 1 5473 585.3463.2

wgEncodeSydhTfbsCh12Mafkab50322IggrabPk CH12 MAFK 38914 1 8918 423.0434.7

40 wgEncodeSydhTfbsMelMafkab50322IggrabPk MEL MAFK 11369 1 6308 386.3415.9

wgEncodeSydhTfbsMelMafkDm2p5dStdPk MEL MAFK 1191 28 9421 466.1370.1

wgEncodeSydhTfbsEse14MafkStdPk ES-E14 MAFK 14597 16 6152 390.3217.0

wgEncodeSydhTfbsMelMaxIggrabPk MEL MAX 10079 1 7565 899.0816.9

wgEncodeSydhTfbsCh12MaxIggrabPk CH12 MAX 31133 1 9066 741.0693.3

wgEncodeCaltechTfbsC2c12MaxFCntrl50bPcr1xPkRep1 C2C12 MAX 8930 103 2242 674.4262.9

Continued on next page Table 1.3.7 – continued from previous page

Track Cell TF # PeaksMin Max Mean Std

wgEncodeCaltechTfbsC2c12MaxFCntrl50bE2p60hPcr1xPkRep1 C2C12 MAX 2085 109 1390 464.3149.4

wgEncodeSydhTfbsMelCmybh141IggrabPk MEL MYB 4019 1 6956 751.8618.8

wgEncodeSydhTfbsMelCmybsc7874IggrabPk MEL MYB 4019 1 6956 751.8618.8

wgEncodeSydhTfbsCh12CmycIggrabPk CH12 MYC 19356 1 9735 896.7734.7

41 wgEncodeSydhTfbsMelCmycIggrabPk MEL MYC 6297 1102421124.5876.5

wgEncodeCaltechTfbsC2c12Sc32758FCntrl50bE2p7dPcr1xPkRep1 C2C12 MYOD1 3247 103 1019 380.6115.4

wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bE2p24hPcr2xPkRep1C2C12 MYOD1 20160 87 2338 386.3159.5

wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bPcr2xPkRep1 C2C12 MYOD1 13040 92 1997 474.0199.9

wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bE2p60hPcr2xPkRep1C2C12 MYOD1 29226 67 5176 504.8265.0

wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bE2p60hPcr2xPkRep1C2C12 MYOG 32173 66 3463 504.0257.4

Continued on next page Table 1.3.7 – continued from previous page

Track Cell TF # PeaksMin Max Mean Std

wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bE2p24hPcr2xPkRep1C2C12 MYOG 24360 65 1968 339.8156.1

wgEncodeCaltechTfbsC2c12Sc12732FCntrl50bE2p7dPcr1xPkRep1 C2C12 MYOG 6885 118 1515 423.8124.9

wgEncodeCaltechTfbsC2c12SrfFCntrl32bE2p24hPcr2xPkRep1 C2C12 SRF 2610 177 2672 549.4237.5

wgEncodeSydhTfbsMelTbpIggmusPk MEL TBP 25442 110597 795.6705.6

42 wgEncodeSydhTfbsCh12TbpIggmusPk CH12 TBP 17492 1 9544 779.4567.9

wgEncodeCaltechTfbsC2c12Tcf3FCntrl32bE2p5dPcr2xPkRep1 C2C12 TCF3 10172 65 2026 324.4136.7

wgEncodeCaltechTfbsC2c12Usf1FCntrl50bE2p60hPcr1xPkRep1 C2C12 USF1 6742 100 1165 359.5120.0

wgEncodeCaltechTfbsC2c12Usf1FCntrl50bPcr1xPkRep1 C2C12 USF1 8579 108 1893 415.2128.2 It is useful to know if LASAGNA-ChIP is able to align peak sequences and reveal the motif of the ChIPed TF. To align peak sequences, parameter

Ka was searched from 0 to 8 to obtain the alignment with the highest IC.

MEME was also used to align peak sequences because it is often the choice of method. In fact, MEME is used by 5 out of 6 tools compared in [80] for ChIP-seq data analysis. The MEME parameters are described in section

Comparison of alignment algorithms, where the one-per-sequence model is assumed. To ensure that both methods ﬁnish within reasonable time, for each experiment, we randomly sampled 300 peaks for alignment. We did not distinguish large peaks from small ones because ChIP-seq experiments require large numbers of cells and hence “a small peak could represent very strong binding in only a subset of the cells” [22].

For each alignment, we searched for the resulting motif in 386 UniPROBE mouse motifs and 398 motifs derived from all the matrices in the TRANS-

FAC Public database. The search was accomplished by software TOMTOM

[33]. We used Pearson correlation as the distance measure, required a minimal overlap of 5 nucleotides, and set the E-value cut-oﬀ to 5. Table 1.3.8 shows, for each ChIP-seq experiment, the sequence logos of motifs found by

LASAGNA-ChIP and MEME. The matching motifs found by TOMTOM are listed under each sequence logo [15] by E-value. In case more than 10 signif-

43 icant motifs were found, only the 10 most signiﬁcant ones were shown. For each ChIP-seq experiment, the one matching the ChIPed TF is highlighted in yellow, while the one matching a cofactor of the ChIPed TF is highlighted in blue if the ChIPed TF is not found.

We first notice that overall the motifs found by LASAGNA-ChIP and MEME are very similar by visual inspection. No significant difference is observed in terms of motif IC (p-value: 0.1252). For both LASAGNA-ChIP and MEME, the ChIPed TF motifs were found for 31 experiments. Among the other 7 experiments are one MYB in MEL cells, all the ETS1 in CH12 and MEL cells, one JUND in MEL cells, one MAX in C2C12 cells, all the TBP in CH12 and

MEL cells. Interestingly, LASAGNA-ChIP and MEME diﬀer only for 4 out of these 7 experiments. They are one ETS1 in CH12 cells, one MAX in C2C12 cells and two TBP in CH12 and MEL cells. Although LASAGNA-ChIP and

MEME diﬀer in these cases, the found motifs still warrant further analyses.

For instance, the motif for ETS1 in CH12 cells found by LASAGNA-ChIP resembles the secondary motif of Gabpa (highlighted in green), which is a known paralog.

For the other 3 out of the 7 experiments, LASAGNA-ChIP and MEME produced similar motifs. The one found for MYB in MEL resembles those

44 of GATA proteins. This agrees with a recent study reporting that MYB and GATA-3 cooperatively regulate IL-13 by direct binding to a conserved

GATA-3 response element [42]. Since this motif is based on 300 peak sequences, it is likely that the two proteins similarly regulate other genes in

MEL cells. The motif for ETS1 in MEL cells also matches those of GATA proteins. Cooperation between ETS1 and GATA-3 in regulating IL-5 was also suggested [9, 86]. Finally, while the motif for JUND in MEL cells matches two motifs in the TRANSFAC and UniPROBE databases, the matches are likely false positives since no literature support was found.

While it is not speciﬁcally designed to be a de novo motif discovery method,

LASAGNA-ChIP aligns all the peak sequences and ﬁnds the most informa- tive motif. The assumption that a motif instance is present in each peak sequence may not hold for some experiments. Because of several possible binding models [22], two or more motifs may be present in subsets of the peak sequences. Discovery of more than one motif will be enabled for

LASAGNA-ChIP in the near future. Table 1.3.8: Sequence Logos of Motifs Found by LASAGNA-ChIP and MEME

TF Cell LASAGNA-ChIP MEME wgEncodeSydhTfbsMelBhlhe40cIggrabPk Continued on next page

45 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 7.14 bits IC: 7.21 bits 2.0 2.0

1.0 1.0 bits C bits CA G C CGT G AC A G AT G A CT G GAAT C T T C A C TG A T T AC G CG G GC T T T A 0.0 0.0 5 5 BHLHE40 MEL WebLogo 3.2 WebLogo 3.2 USF; SREBP-1; Arnt;USF; Arnt; N-Myc; SREBP- Bhlhb2 primary; N-Myc;1; c-Myc:Max; MyoD; c-Myc:Max; Max; PHO4;Max; Bhlhb2 primary; Max primary; MyoD; GBP;Max primary; PHO4; PIF3; RAV1; PIF3 Sn; E47; Lmo2complex; GBP; Myf6 primary wgEncodeSydhTfbsCh12Bhlhe40nb100IggrabPk IC: 8.32 bits IC: 8.19 bits 2.0 2.0

1.0 1.0 bits GT bits C G C G A G C T G TCC C TAC G A A CG GG C A T G C G TA T C AT A G C GG A T A T T TT 0.0 0.0 5 5 BHLHE40 CH12 WebLogo 3.2 WebLogo 3.2 c-Myc:Max; Arnt; USF;USF; Arnt; c-Myc:Max; Bhlhb2 primary; N-Myc;Bhlhb2 primary; SREBP- SREBP-1; Max; PIF3; GBP;1; Max; Max primary; Max primary; Hairy; PHO4;N-Myc; PIF3; RAV1; RAV1; Max secondary;PHO4; Max secondary; MyoD; Zscan4 primary;GBP; Hairy; MyoD; Zs- Lmo2complex; can4 primary Tcfe2a primary; E47 wgEncodeCaltechTfbsC2c12CebpbFCntrl50bE2p60hPcr1xPkRep1 IC: 10.41 bits IC: 10.4 bits 2.0 2.0

1.0 1.0 bits G A bits G A A AT T A AT T TT TC C T TC C G G T G G C C C C AT G ACTG G A G C C CG C TT A A TT A 0.0 0.0 5 10 5 10 CEBPB C2C12 WebLogo 3.2 WebLogo 3.2 C/EBPbeta; C/EBP;C/EBPbeta; C/EBP; C/EBPalpha; HLF; VBP;C/EBPalpha; HLF; VBP; E4BP4; Mafb secondary;E4BP4; Mafb secondary; CHOP:C/EBPalpha; CHOP:C/EBPalpha; ces-2; Dlx2 2273.2;ces-2; Dlx2 2273.2; Mafk secondary; Mafk secondary; Cphx 3484.1; Hdx 3845.3 Cphx 3484.1 wgEncodeCaltechTfbsC2c12CebpbFCntrl50bE2p60hPcr1xPkRep2 Continued on next page

46 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 9.56 bits IC: 9.79 bits 2.0 2.0

1.0 1.0 bits G bits G TGC A C A A T ATG A T A C A G C T A GC G C G GC C G A T AT C T A G T G A GAG CT T G T G C C CC ATC TA CA GTCA T A 0.0 0.0 5 10 15 5 10 CEBPB C2C12 WebLogo 3.2 WebLogo 3.2 C/EBPbeta; C/EBP;C/EBPbeta; C/EBP; C/EBPalpha; VBP; HLF;C/EBPalpha; VBP; E4BP4; CHOP:C/EBPalpha; CHOP:C/EBPalpha; E4BP4; Mafb secondary;HLF; CREB; MATa1; Zfp105 secondary; MATa1;Mafb secondary; Dlx2 2273.2; ces-2;Cphx 3484.1; ces-2 Bsx 3483.2; CREB; Oct- 1; Cphx 3484.1; Hdx 3845.3; Gmeb1 primary wgEncodeCaltechTfbsC2c12CebpbFCntrl50bPcr1xPkRep1 IC: 10.39 bits IC: 10.49 bits 2.0 2.0

1.0 1.0 bits bits

G A G CA A C A T G T A AT G T A CT C CC T CCT G G T G T C C C G A GA T A C A A C T G GG C T A T T A T 0.0 0.0 5 10 5 10 CEBPB C2C12 WebLogo 3.2 WebLogo 3.2 C/EBPbeta; C/EBP;C/EBPbeta; C/EBPalpha; C/EBPalpha; HLF;C/EBP; HLF; Mafb secondary; Mafb secondary; E4BP4; CHOP:C/EBPalpha; E4BP4;ces-2; CHOP:C/EBPalpha; ces-2; VBP; Dlx2 2273.2;VBP; Dlx2 2273.2; Hdx 3845.3; Hoxa6 1040.1;Hoxa6 1040.1 Zfp105 secondary; Mafk secondary wgEncodeSydhTfbsCh12CjunIggrabPk IC: 8.0 bits IC: 7.91 bits 2.0 2.0

1.0 1.0 bits bits

G AT CTCA T GA T A T C AC G G GA GGTC CCT G A C T TT A A A C A T A A C T T ACC GG C A G A C G G T C G C G GC CG CG T T A AG T A AT 0.0 0.0 5 10 5 10 15 JUN CH12 WebLogo 3.2 WebLogo 3.2 AP-1; GCN4; TCF11:MafG;AP-1; Jundm2 secondary; cap; Jundm2 secondary;GCN4; NF-E2; TCF11; Bach2; NF-E2; Dfd;TCF11:MafG; v- CRP; v-Maf; Six4 2860.1 Maf; Six4 2860.1; Atf1 secondary; Bach2 wgEncodeSydhTfbsMelCmybh141IggrabPk Continued on next page

47 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 8.2 bits IC: 8.21 bits 2.0 2.0

1.0 1.0 bits T T bits A A C T A G CTT C G GA T C T G GA C T C T C C C G G CA G C C C G A T T G A A C A G G A T A C T T G T C GA G C A AT T AA TTA A 0.0 0.0 5 10 5 10 15 MYB MEL WebLogo 3.2 WebLogo 3.2 GATA-1; GATA-GATA-1; GATA- 2; Gata5 primary;2; Gata3 primary Gata6 primary; GATA-6;; Gata6 primary; Gata3 primary ; GATA-3;Gata5 primary; GATA- mtTFA; Lmo2complex;3; GATA-6; Lmo2complex; Gata3 secondary; GATA-Gata3 secondary; mt- X; Evi-1; Bbx secondary;TFA; GATA-X; Evi- Gabpa secondary; NIT2 1; Gabpa secondary; Sox7 secondary; NIT2 wgEncodeSydhTfbsMelCmybsc7874IggrabPk IC: 8.17 bits IC: 8.23 bits 2.0 2.0

1.0 1.0 bits bits

T C C A C T C C C T A G T A AG T TT T A AGC T C T T GC C G GAC C AA A A G G CT A GC C AGA A A T 0.0 0.0 5 10 5 MYB MEL WebLogo 3.2 WebLogo 3.2 c-Myb; Myb secondary;Mybl1 secondary; Ovo; Mybl1 secondary;Myb secondary; c-Myb; v-Myb; GAmyb Ovo; GAmyb; v-Myb; MyoD; MIF-1 wgEncodeSydhTfbsCh12CmycIggrabPk IC: 7.96 bits IC: 7.98 bits 2.0 2.0

1.0 1.0 bits bits C ACATG CCACAT A C A A GG GG TGG A G G G T G G C G C T C T A C C T A A G T C G GT C G GG C GG C 0.0 T A A AAA 0.0 A A 5 10 15 5 10 MYC CH12 WebLogo 3.2 WebLogo 3.2 Max primary; c-Myc:Max;Max primary; c- USF; Arnt; Max;Myc:Max; USF; N-Myc; Tcfe2a secondary; N-Lmo2complex; PHO4; Max; Myc; Tal-1beta:E47;Arnt; Tcfe2a secondary; Lmo2complex; Tal-MyoD; Sn; RAV1; PIF3; 1alpha:E47; Tal-Tcfe2a primary; Tal- 1beta:ITF-2; PHO4;1alpha:E47; Tal-1beta:E47; PIF3; Tcfe2a primary;E47; GBP; Ascl2 primary; Sn; Myf6 primary;Tal-1beta:ITF-2 Bhlhb2 secondary; E47; Ascl2 primary; MyoD Continued on next page

48 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME wgEncodeSydhTfbsMelCmycIggrabPk IC: 8.44 bits IC: 8.45 bits 2.0 2.0

1.0 1.0 bits CA bits A G G C C T A TG A C G A A C GG G A G G T C G A C T G A T C CTAC T C T A G G GG C G TCT AA TA TA A AA 0.0 0.0 5 10 5 MYC MEL WebLogo 3.2 WebLogo 3.2 c-Myc:Max; Max primary;c-Myc:Max; Max primary; N-Myc; Max; USF;PHO4; Max; N-Myc; PHO4; Arnt; PIF3;USF; Arnt; GBP; PIF3; GBP; Lmo2complex;Lmo2complex; MyoD; Tcfe2a secondary; MyoD;Bhlhb2 secondary; Sn; Bhlhb2 secondary; Tcfe2a secondary Sn; Tcfe2a primary; Max secondary; Hairy wgEncodeSydhTfbsCh12Ets1IggrabPk IC: 7.58 bits IC: 7.99 bits 2.0 2.0

1.0 1.0 bits TG G bits T T CC TCCCT T GTT CT C C T G T T C TT A T T CCC C T A TA A C T T T C G C C T T C C A C C A C A T A G G GA A G G A G A G C G C G AA G G C G GGG G G T A TA T A AA AT A A 0.0 0.0 5 10 15 5 10 15 ETS1 CH12 WebLogo 3.2 WebLogo 3.2 Irf6 secondary; RAV1; cap; Ovo; ISRE; Sox12 secondary; Tcf3 secondary; p300; Gabpa secondary ;Sox4 primary Ascl2 secondary; p300; MEIS1; cap wgEncodeSydhTfbsMelEts1IggrabPk IC: 8.63 bits IC: 8.71 bits 2.0 2.0

1.0 1.0 bits bits TT T T A ATAAG C GC G GC G A A TC A A G A A G T CT G G G TG CC A A T G C T C T C C G C C C AA AT A T TT T 0.0 0.0 5 5 10 ETS1 MEL WebLogo 3.2 WebLogo 3.2 GATA-1; GATA-2;GATA-1; GATA- Gata6 primary; GATA-6;2; Gata6 primary; GATA-3 ; Gata5 primary;Gata5 primary; GATA-6; Lmo2complex; mtTFA;GATA-3 ; Gata3 primary; Gata3 primary; Evi-1;GATA-X; mtTFA; GATA-X; Gata3 secondary;Lmo2complex; Evi-1; NIT2 Gata3 secondary wgEncodeCaltechTfbsC2c12Fosl1sc605FCntrl36bPcr1xPkRep1 Continued on next page

49 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 11.59 bits IC: 11.19 bits 2.0 2.0

1.0 1.0 bits bits A C C A G G C C T G T G A G G T TC T GC A A C A T A C C T C C C A C GT T TA TC CC G T T TA T 0.0 0.0 5 5 FOSL1 C2C12 WebLogo 3.2 WebLogo 3.2 GCN4; Bach2; AP-1;AP-1; NF-E2; Bach2; Bach1; Jundm2 secondary;GCN4; Jundm2 secondary; NF-E2; TCF11:MafG;Bach1; TCF11:MafG; v-Maf; v-Maf; Tax/CREB;Tax/CREB Zfp691 secondary wgEncodeSydhTfbsMelGata1Dm2p5dStdPk IC: 9.22 bits IC: 9.2 bits 2.0 2.0

1.0 1.0 bits A A bits A A A G TA G C T AC G A A G AA G G GG G A G G G G G G G T T T T C T C C C A C T T T C C T T C C CC C C C A A TT A T TT 0.0 0.0 5 10 5 10 GATA1 MEL WebLogo 3.2 WebLogo 3.2 GATA-2; GATA-1;GATA-2; GATA-1; Gata6 primary; GATA-Gata6 primary; GATA- 3; GATA-6; Gata5 primary;3; GATA-6; Gata3 primary; Gata3 primary; mtTFA;Gata5 primary; mt- GATA-X; Lmo2complex;TFA; GATA-X; Evi-1; Gata3 secondary;Lmo2complex; Evi-1; NIT2 NIT2; Gata3 secondary wgEncodeSydhTfbsMelGata1IggratPk IC: 9.52 bits IC: 9.5 bits 2.0 2.0

1.0 1.0 bits T bits A CT A C TAAG T T T C G A G T A T G G C C G AG CG T GC A A A A G T G C G C C T CC GC C G A A AT TT A T 0.0 0.0 5 10 5 10 GATA1 MEL WebLogo 3.2 WebLogo 3.2 GATA-1; Gata6 primary;GATA-1; Gata6 primary; GATA-2; GATA-GATA-2; GATA- X; Gata5 primary;X; Gata3 primary; Gata3 primary; GATA-Gata5 primary; GATA- 3; GATA-6; mtTFA;3; GATA-6; mtTFA; Lmo2complex; Evi-1;Lmo2complex; Evi-1; NIT2 NIT2; qa-1F wgEncodeSydhTfbsCh12JundIggrabPk Continued on next page

50 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 7.97 bits IC: 8.37 bits 2.0 2.0

1.0 1.0 bits bits CTCA AT CTCA T G A A T A A AA T C CG G A AGG G C C G A A T A G GG CTTGAC T CGG C C C G T T GA A A A T T 0.0 0.0 5 5 10 15 JUND CH12 WebLogo 3.2 WebLogo 3.2 AP-1; Bach2; TCF11:MafG;AP-1; GCN4; TCF11:MafG; GCN4; NF-E2; v-Maf;Jundm2 secondary; Bach1; Jundm2 secondary;Bach2; NF-E2; Bach1; Mafb primary; TCF11; v-Maf; cap; Dfd; Mafk primary Zfp691 secondary wgEncodeSydhTfbsMelJundIggrabPk IC: 16.59 bits IC: 14.47 bits 2.0 2.0

1.0 1.0 bits AGAGAGAGA GAG bits GAG GAG G A A A A A G G C T A CT G G C G T G TCA G CTC C C A G C GA CA C C GT A A C G A AGA C G T T G TA G T T C A T AC C G G C G C C C C C TT TT T T A A A TT T TT T 0.0 0.0 5 10 15 5 10 15 JUND MEL WebLogo 3.2 WebLogo 3.2 p300; Irf6 secondary Irf6 secondary; p300 wgEncodeSydhTfbsEse14MafkStdPk IC: 13.47 bits IC: 13.47 bits 2.0 2.0

1.0 1.0 bits bits

GT AGC GCT AC AA G C G TT G C T C C G A A TT G C AA T TT C T A CC A AA G C G A AT T AG T G A C AT GC T A G C C C C A G GG T CG T A A G T C A C G G GG G C G CC G G C CC 0.0 TT AA A 0.0 TT TTT AA 5 10 15 5 10 15 MAFK ES-E14 WebLogo 3.2 WebLogo 3.2 v-Maf; TCF11:MafG;v-Maf; TCF11:MafG; NF-E2; Mafb primary;NF-E2; Mafb primary; AP-1; Mafk primary;AP-1; Mafk primary; Jundm2 secondary; Bach2;Jundm2 secondary; Bach2; GCN4; Bach1; AP-4 GCN4; Bach1; AP-4; XFD-3 wgEncodeSydhTfbsMelMafkDm2p5dStdPk IC: 13.14 bits IC: 13.01 bits 2.0 2.0

1.0 1.0 bits bits G C C A G C G T A T T C A C A C T A C G T C T T A T T T C G A G C G G C T A A AC G G AA G TG T GG G A A A G AC A A G T T A C AG TG C A G GA G G T C A C A G T C G C G C G C C C G C G C 0.0 TT A T A T 0.0 TT T T 5 10 15 5 10 15 MAFK MEL WebLogo 3.2 WebLogo 3.2 v-Maf; TCF11:MafG;v-Maf; TCF11:MafG; NF-E2; NF-E2; Mafb primary;AP-1; Mafb primary; AP-1; Jundm2 secondary;Jundm2 secondary; Mafk primary; Bach2;Mafk primary; Bach2; GCN4; Bach1; AP-4 GCN4; Bach1; AP-4 wgEncodeSydhTfbsCh12Mafkab50322IggrabPk Continued on next page

51 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 8.76 bits IC: 8.7 bits 2.0 2.0

1.0 1.0 bits bits TC CA TC A AG AC GC GCTGT A A T T C A G T T C T G A C G A C A CA A G A A AGT G A C A CA C A C T T T A T G GG C A T A G AT T G C GGCC T CC G CG G GG G T T TT T T ATAT 0.0 0.0 5 10 15 5 10 15 MAFK CH12 WebLogo 3.2 WebLogo 3.2 v-Maf; Mafb primary;v-Maf; Mafb primary; Mafk primary; TCF11:MafG; AP-1; TCF11:MafG; AP-1; NF-NF-E2; Mafk primary; E2; Jundm2 secondary;Zic2 secondary; AP-4; XFD-3; RAV1;Zic1 secondary; Zic2 secondary Zic3 secondary; Jundm2 secondary; AP- 4 wgEncodeSydhTfbsMelMafkab50322IggrabPk IC: 8.65 bits IC: 8.43 bits 2.0 2.0

1.0 1.0 bits bits GCTGA A A TC GC AT T G C A T C G G CT T A A C GTG A T A T A A T TA A A T TCTG A T A A C T A A C A C T G G T T T A T T C C C C CT T A T A A C G G A C G G G G C CC G C C G G G G G C G 0.0 A A A 0.0 T TTT 5 10 15 5 10 15 MAFK MEL WebLogo 3.2 WebLogo 3.2 v-Maf; TCF11:MafG;TCF11:MafG; Mafb primary; NF-E2;Mafb primary; AP-1; Mafk primary;Mafk primary; v-Maf; NF- Jundm2 secondary; Bach2;E2; AP-1; Zic3 secondary; C/EBP Zic2 secondary; Jundm2 secondary; Zic1 secondary; C/EBP; Pbx1 3203.1; AP-4 wgEncodeCaltechTfbsC2c12MaxFCntrl50bE2p60hPcr1xPkRep1 Continued on next page

52 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 7.76 bits IC: 7.96 bits 2.0 2.0

1.0 1.0 bits bits

C AC TG ACATG GT A G ATG G C T AA G A C T C G G GA GT T G G C A C C T AC T T G AC C G T C G G C C C C C T T T T ATA AA 0.0 0.0 5 10 5 10 MAX C2C12 WebLogo 3.2 WebLogo 3.2 USF; c-Myc:Max; Max;USF; c-Myc:Max; N-Myc; Max primary; GBP;Max; Max primary; Arnt; N-Myc; Arnt; PIF3;GBP; Tcfe2a secondary; PHO4; Tcfe2a secondary;PIF3; PHO4; Lmo2complex; Max secondary; MyoD; Max secondary; Lmo2complex; MyoD;Myf6 primary; RAV1; Tal- RAV1; Bhlhb2 primary;1alpha:E47; Tal-1beta:E47; SREBP-1; Sn Tcfe2a primary; Sn; Tal- 1beta:ITF-2 wgEncodeCaltechTfbsC2c12MaxFCntrl50bPcr1xPkRep1 IC: 7.67 bits IC: 7.91 bits 2.0 2.0

1.0 1.0 bits bits A A C C GC T TT GC C G C G CC C T G C C T T C C G GCC C T T T G G T A C T A C G G T C T C C G T G A C A T T T T G AG G G G T C G C T G G CG A AA AA A A A A A T 0.0 0.0 5 10 5 10 15 MAX C2C12 WebLogo 3.2 WebLogo 3.2 Gabpa secondary; Ascl2 secondary; Hic1 secondary; NF-1;Gabpa secondary; cap; AP-1; Smad3 primary;Pax-4 MyoD; Tcfap2a secondary; Tcfe2a secondary; AP- 2alpha wgEncodeSydhTfbsCh12MaxIggrabPk Continued on next page

53 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 7.93 bits IC: 7.79 bits 2.0 2.0

1.0 1.0 bits bits ATG CAT G T T G CC C G G G A C A C C G C T A T TG T AC A TAC G G CC A G G GG CC 0.0 TA T A 0.0 AAA 5 5 MAX CH12 WebLogo 3.2 WebLogo 3.2 USF; N-Myc; c-Myc:Max;USF; Max primary; c- PHO4; Max primary;Myc:Max; PIF3; N-Myc; Arnt; GBP; Max; PIF3;Arnt; PHO4; Max; GBP; MyoD; Tcfe2a secondary;MyoD; Sn; Lmo2complex; Max secondary; SREBP-1; Tcfe2a secondary; Lmo2complex; SREBP-Zscan4 primary; 1; Bhlhb2 secondary; RAV1 Max secondary; Tal- 1beta:ITF-2; Tal- 1beta:E47; Tal-1alpha:E47; Tcfe2a primary wgEncodeSydhTfbsMelMaxIggrabPk IC: 8.44 bits IC: 8.4 bits 2.0 2.0

1.0 1.0 bits bits

CGTG G A C CG C T A G G T A AA C G T G A T T G C C G C C A T A T A G C T T T G A G T T T G A GTG T C G C C G C C G G TA A C A 0.0 T 0.0 A 5 10 15 5 10 15 MAX MEL WebLogo 3.2 WebLogo 3.2 c-Myc:Max; Max primary;c-Myc:Max; Max primary; Arnt; USF; PIF3; N-Myc;Arnt; Max; Bhlhb2 primary; Max; Bhlhb2 secondary;USF; N-Myc; PHO4; Bhlhb2 primary; PIF3; Bhlhb2 secondary; PHO4; GBP; E47;GBP; SREBP-1; MyoD; MyoD; Tcfe2a primary;Sn; Max secondary; SREBP-1; HTF; RAV1;Tcfe2a primary; Max secondary; Sn; GR Myf6 primary; E47; Nkx2- 2 2823.1; Hairy wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bE2p24hPcr2xPkRep1 Continued on next page

54 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 9.93 bits IC: 9.96 bits 2.0 2.0

1.0 1.0 bits bits CA CTG C GCT C C G G A G TG G C G A C A GA G A C G G C C G C C T C C CC C T C GA GC T A GGG TTC G A T G AT A A T C AA G G T C C C C C G C 0.0 T A T AA A 0.0 AT T T A 5 10 15 5 10 15 MYOG C2C12 WebLogo 3.2 WebLogo 3.2 Ascl2 primary; E47;Ascl2 primary; AP-4; Lmo2complex; AP-Lmo2complex; E47; HEN1; 4; Myf6 primary;MyoD; Myf6 primary; MyoD; Zic1 secondary;Tcfe2a primary; AREB6; Zic2 secondary; Tcfe2a secondary; Sn; Tcfe2a secondary; Tgif1 2342.2; Tal- Tcfe2a primary; 1alpha:E47; Zic1 secondary; Zic3 secondary; Sn; HEN1;Pknox2 3077.2; AREB6; Myf6 secondary;Zic3 secondary; Gﬁ-1; Tal- Tgif1 2342.2; Tal-1beta:E47; Zic2 secondary; 1alpha:E47; RAV1; Tal-Tgif2 3451.1 1beta:E47; Pknox2 3077.2 wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bE2p60hPcr2xPkRep1 IC: 10.49 bits IC: 10.44 bits 2.0 2.0

1.0 1.0 bits bits G C G C T T C G A A T G A A G G TC G G G C C G C A GA C C C C GA C T G G C C T T A G C TC AT G G T G TA C CC T GG C A A C G A T A C T G C G C A T CT T T T C G A TTTT TA T T A T A A AAAA 0.0 0.0 5 10 15 5 10 15 MYOG C2C12 WebLogo 3.2 WebLogo 3.2 Myf6 primary; Myf6 primary; Ascl2 primary; E47; AP-4;Ascl2 primary; E47; AP-4; MyoD; Tcfe2a secondary;Lmo2complex; MyoD; Tcfe2a primary; Tcfe2a primary; HEN1; Lmo2complex; HEN1;Tcfe2a secondary; Sn; Tal- Sn; Tal-1beta:E47; Tal-1beta:E47; Tal-1beta:ITF-2; 1beta:ITF-2; AREB6;Tal-1alpha:E47; AREB6; RAV1; Tal-1alpha:E47;RAV1; USF; Tgif1 2342.2; c-Myc:Max; USF;c-Myc:Max; Arnt; myo- Tgif1 2342.2; myogenin/NF-genin/NF-1 1; Mybl1 secondary wgEncodeCaltechTfbsC2c12Sc12732FCntrl50bE2p7dPcr1xPkRep1 Continued on next page

55 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 12.31 bits IC: 12.26 bits 2.0 2.0

1.0 1.0 bits AA C bits AA C G G G G C G G A G G C A C GG GC G G A C C T CA GT G A C C C T C G T A AAT A T A AT A 0.0 0.0 5 10 5 10 MYOG C2C12 WebLogo 3.2 WebLogo 3.2 Ascl2 primary; Ascl2 primary; Myf6 primary; Sn;Myf6 primary; Sn; MyoD; AP-4; HEN1;MyoD; AP-4; HEN1; E47; Lmo2complex;E47; Lmo2complex; Tcfe2a secondary; Tal-Tcfe2a secondary; myo- 1beta:ITF-2; myo-genin/NF-1; Tal-1beta:ITF-2; genin/NF-1; Tal-1alpha:E47;Tal-1alpha:E47; RP58; RP58; Tcfe2a primary;Tcfe2a primary; Tal- Myf6 secondary; 1beta:E47; Myf6 secondary; Tgif1 2342.2; Tal-1beta:E47;Tgif1 2342.2; AREB6; AREB6; Tgif2 3451.1;Pknox2 3077.2; Tgif2 3451.1 Pknox2 3077.2 wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bE2p24hPcr2xPkRep1 IC: 12.35 bits IC: 12.39 bits 2.0 2.0

1.0 1.0 bits bits G TT AA C G G CGC G C CC G AC C G G C G G T C AG AA GT T TC C C A T G G CC T AA T AAA T ATT AT A 0.0 0.0 5 10 5 10 MYOD1 C2C12 WebLogo 3.2 WebLogo 3.2 Ascl2 primary; Ascl2 primary; Myf6 primary; MyoD;Myf6 primary; MyoD; HEN1; Sn; AP-4;Sn; HEN1; AP-4; Lmo2complex; E47;E47; Lmo2complex; Tcfe2a primary; Tal-Tcfe2a primary; Tal- 1beta:ITF-2; Tal-1alpha:E47;1beta:ITF-2; Tal- myogenin/NF-1; Tal-1alpha:E47; myogenin/NF- 1beta:E47; RP58;1; Tcfe2a secondary; Tcfe2a secondary; RP58; Tal-1beta:E47; Myf6 secondary; RAV1;Myf6 secondary; AREB6; AREB6; Tgif1 2342.2; Adf-1 RAV1; Tgif1 2342.2; Pknox2 3077.2 wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bE2p60hPcr2xPkRep1 Continued on next page

56 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 11.58 bits IC: 11.29 bits 2.0 2.0

1.0 1.0 bits bits G CC G CC

G C C A A C TC T C C A TTCT C G G G C C TC C G C C G A T G G T A T T G C TAG T AC ACG T C GG T G C G A A T A TT GA 0.0 0.0 5 10 15 5 10 MYOD1 C2C12 WebLogo 3.2 WebLogo 3.2 Ascl2 primary; Ascl2 primary; MyoD; Myf6 primary; MyoD;Myf6 primary; Sn; HEN1; E47; Lmo2complex; Sn;E47; AP-4; Lmo2complex; AP-4; Tcfe2a primary;Tcfe2a secondary; HEN1; Tcfe2a secondary;Tcfe2a primary; Mybl1 secondary; RAV1; Tal-1alpha:E47; Myb secondary; Tal-Tgif1 2342.2; Tal-1beta:ITF- 1beta:ITF-2; Tal-2; myogenin/NF-1; AREB6; 1alpha:E47; Tal-1beta:E47;Myb secondary; Tal- Myf6 secondary; RAV1;1beta:E47; USF; c-Myb c-Myb; c-Myc:Max; AREB6 wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bPcr2xPkRep1 IC: 10.82 bits IC: 10.75 bits 2.0 2.0

1.0 1.0 bits bits AA GCT AA GC G G G G G A G T T GC GC A C G G C A C CC T T G G G C G A G C T TA AA T A A 0.0 0.0 5 10 5 MYOD1 C2C12 WebLogo 3.2 WebLogo 3.2 Myf6 primary; Ascl2 primary; Ascl2 primary; MyoD;Myf6 primary; MyoD; Sn; E47; AP-4; HEN1;AP-4; E47; Sn; Tal-1beta:ITF-2; Tal-myogenin/NF-1; Tal- 1alpha:E47; Lmo2complex;1alpha:E47; Tcfe2a primary; myogenin/NF-1; Lmo2complex; HEN1; Tcfe2a primary; Tgif2 3451.1; Tgif1 2342.2; Tal-1beta:E47; Tal-1beta:ITF-2; Tcfe2a secondary; RP58;Pknox2 3077.2; Tal- AREB6; Myf6 secondary;1beta:E47; Mrg1 2246.2; RAV1; Tgif1 2342.2; USF Meis1 2335.1; AREB6; Mrg2 2302.1 wgEncodeCaltechTfbsC2c12Sc32758FCntrl50bE2p7dPcr1xPkRep1 Continued on next page

57 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 13.04 bits IC: 12.89 bits 2.0 2.0

1.0 1.0 bits bits AA C AA C G G AA C AA G C G G G G G G G GC C G G G G G G A G C G C G C C G A C C A C T A C A T G A A A G C C A A C G CC C TT A AT A TT A AAT A 0.0 0.0 5 10 15 5 10 15 MYOD1 C2C12 WebLogo 3.2 WebLogo 3.2 Ascl2 primary; Ascl2 primary; Myf6 primary; HEN1;Myf6 primary; HEN1; MyoD; Sn; AP-4;MyoD; Sn; AP-4; E47; Lmo2complex;E47; Tcfe2a secondary; Tcfe2a secondary; Lmo2complex; Tgif1 2342.2; Tgif1 2342.2; Tcfe2a primary; Tcfe2a primary; c- Myf6 secondary; c-Myb; Myf6 secondary; Myb; Tal-1beta:ITF-2;Pknox2 3077.2; RP58; Tal- Pknox2 3077.2; RP58; Tal-1beta:ITF-2; Tal-1alpha:E47; 1alpha:E47; Tal-1beta:E47;Tal-1beta:E47; AREB6; Eomes secondary; AREB6 myogenin/NF-1 wgEncodeCaltechTfbsC2c12SrfFCntrl32bE2p24hPcr2xPkRep1 IC: 8.84 bits IC: 8.84 bits 2.0 2.0

1.0 1.0 bits bits TT T GG CT T GG C G T T A AT T C C A C T AC A C A T CA A T T C CT C A C T G A C G A T A C A G T G A T T G C C A T C A A C C C GG C T C C A A C A T C G C G G G G G G G G G C G G G C G AT A AT A A T 0.0 0.0 5 10 15 5 10 15 SRF C2C12 WebLogo 3.2 WebLogo 3.2 SRF; Srf primary; AG;SRF; Srf primary; AG; YY1; AGL3; TATA;AGL3; YY1; MCM1; Tbp secondary; MCM1;Tbp secondary; TATA; GATA-1; Msx-1 GATA-1; Abd-B; Tcf3 secondary wgEncodeSydhTfbsCh12TbpIggmusPk IC: 7.74 bits IC: 6.81 bits 2.0 2.0

1.0 1.0 bits bits

CTTCTCTC GAA G CA G GA G C G G TAC C C GGAG AGA TAAA CA AG T C T AA G T C T G T T T C G TT T A CCT T A AA GG C C C GG C A C C G C G C G G C C G 0.0 A A T T 0.0 CT T 5 10 15 5 10 15 TBP CH12 WebLogo 3.2 WebLogo 3.2 Gabpa secondary; Pax-4; cap; Sox12 secondary Gabpa secondary; Irf3 secondary wgEncodeSydhTfbsMelTbpIggmusPk Continued on next page

58 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 8.38 bits IC: 8.06 bits 2.0 2.0

1.0 1.0 bits bits CC T TG CCA C A C C C G G G C CAT TC G G CTTCCTAT C TA C A A C AA G G C A G A C T GG C C G G T G G G G A A CA G TA C G G G G A T T A T TA ATAA A T AA A 0.0 0.0 5 10 15 5 10 15 TBP MEL WebLogo 3.2 WebLogo 3.2 AP-2rep; CDC5; AP-1; Pax-CDC5; Gﬁ-1; AP-1; GCN4; 4; NF-E2; RAV1 v-Maf; TCF11:MafG; myogenin/NF-1 wgEncodeCaltechTfbsC2c12Tcf3FCntrl32bE2p5dPcr2xPkRep1 IC: 11.32 bits IC: 11.02 bits 2.0 2.0

1.0 1.0 bits bits G TC GA GC C G A C C G C AC T G G GC C G T T C A T G G C A C C T G C T A A C T T C A TC C G C T C G G C T A G C GG G 0.0 T T TACTAA 0.0 A T 5 10 15 5 TCF3 C2C12 WebLogo 3.2 WebLogo 3.2 Ascl2 primary; Ascl2 primary; MyoD; Myf6 primary; MyoD;AP-4; Tgif1 2342.2; E47; Lmo2complex; AP-HEN1; Lmo2complex; 4; Sn; Tcfe2a secondary;Myf6 primary; Sn; Tcfe2a primary; Tal-Tgif2 3451.1; Meis1 2335.1; 1alpha:E47; HEN1;Pknox2 3077.2; Tal-1beta:ITF-2; Tal-Tcfe2a primary; 1beta:E47; Myf6 secondary;Mrg1 2246.2; E47; Myb secondary; Mrg2 2302.1; Tgif1 2342.2; AREB6;myogenin/NF-1; Mybl1 secondary; RAV1;Tcfe2a secondary; Arnt Pknox1 2364.2; TGIF; AREB6 wgEncodeCaltechTfbsC2c12Usf1FCntrl50bE2p60hPcr1xPkRep1 Continued on next page

59 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 12.16 bits IC: 12.0 bits 2.0 2.0

1.0 1.0 bits GT CG bits CG AC A C GA A T C G C GCG CG GT T A A C T A A AATG T A T A C T TG AC C T C C C G C G G G G C G C G T AT AA T T T 0.0 0.0 5 10 5 10 USF1 C2C12 WebLogo 3.2 WebLogo 3.2 USF; Arnt; SREBP-USF; Arnt; SREBP-1; 1; Bhlhb2 primary; c-Bhlhb2 primary; GBP; Myc:Max; Max; GBP; N-c-Myc:Max; PIF3; Max; Myc; PIF3; Max secondary;N-Myc; Max secondary; Max primary; RAV1; XBP-Max primary; RAV1; 1; Hairy; PHO4; MyoD;bZIP911; XBP-1; bZIP911; Rara primary;PHO4; MyoD; Hairy; ATF6; AREB6 Rara primary; ATF6; CF1/USP wgEncodeCaltechTfbsC2c12Usf1FCntrl50bPcr1xPkRep1 IC: 12.4 bits IC: 12.34 bits 2.0 2.0

1.0 1.0 bits GT CG bits CG AC C G C TC A G T CG GGC A T A G T A GT AT A C A T T AG T CC C A TC G G C C G G A A C T G C G G C CCC GGG TA AA AA TT TT TA 0.0 0.0 5 10 5 10 USF1 C2C12 WebLogo 3.2 WebLogo 3.2 USF; Arnt; SREBP-USF; Arnt; Bhlhb2 primary; 1; Bhlhb2 primary; c-SREBP-1; c-Myc:Max; Myc:Max; Max; GBP; N-Max; GBP; N-Myc; Myc; PIF3; Max secondary;PIF3; Max secondary; Max primary; XBP-1;Max primary; XBP- RAV1; Hairy; bZIP911;1; Rara primary; Rara primary; MyoD;RAV1; bZIP911; Hairy; Rxra primary; PHO4;Rxra primary; CF1/USP; CF1/USP Nr2f2 primary; MyoD

1.3.4 LASAGNA is Simple and Eﬀective

Unlike MEME and similar methods, the order in which the input sequences are aligned is crucial to LASAGNA and ClustalW. ClustalW relies on a guide tree based on pairwise alignments to decide the order. LASAGNA, on the other hand, depends on the length of a sequence and its similarity to the partial alignment. LASAGNA-ChIP is well-suited for a TF whose shortest

60 site misses the core or contains only a fraction of it. We, however, observed no signiﬁcant diﬀerence between LASAGNA and LASAGNA-ChIP on TF-

BSs in the TRANSFAC Public database. This is because, for these TFBSs, a shortest site often fully contains the core. Hence, our assumption holds true in general.

For ChIP-seq data, the assumption that short sequences contain less irrelevant bases ﬂanking the core may not hold. However, we observe that, under the one-per-sequence model, LASAGNA-ChIP performed comparably well to MEME in aligning ChIP-seq peak sequences. We attempted other orders such as from the longest sequence to the shortest one and found that aligning the shortest sequence ﬁrst does have its advantage (data not shown). Also, we note that, for 11 out of 38 experiments, the peak sequences are all at least

100 bp (see Table 1.3.7) and hence all the peak sequences are 100 bp long after clipping. This implies that LASAGNA-ChIP is capable of handling sequences of the same length.

LASAGNA-ChIP, MEME and methods alike produce gapless alignments and do have their limits. When a TF binds to two cores separated by a variable-length spacer, these methods are expected to align the canonical

TFBSs containing spacers of the most prevalent length. These binding pat-

61 terns are also known as two-block motifs. Gapped alignment or explicit modeling [8] is needed to correctly align TFBSs of this nature.

1.4 Conclusions

We proposed LASAGNA, a novel alignment algorithm speciﬁcally designed for aligning variable-length transcription factor binding sites. Cross-validation results on 189 TFs and 4771 TFBSs indicated that LASAGNA signiﬁcantly

15 outperformed ClustalW2 (p-value: 1.22 10− ) and MEME (p-value: 3.55 × × 15 10− ). This is because LASAGNA was speciﬁcally designed for aligning variable-length TFBSs. Based on the success of LASAGNA, we developed LASAGNA-ChIP,which is capable of handling sequences produced by

ChIP-chip and ChIP-seq experiments. While ClustalW2 is better suited for producing structurally correct alignments, LASAGNA-ChIP, MEME and methods alike can be used to align sequences produced by ChIP-chip or

ChIP-seq experiments.

Wecompared LASAGNA-PSSM, the PSSM method dependent on LASAGNA, to SiTaR, an alignment free TFBS search method. Cross-validation experiments were conducted on 1751 TFBSs of 90 TFs for both methods. The results showed that, at ﬁxed recall rates, LASAGNA-PSSM is signiﬁcantly

8 more precise than SiTaR (p-value: 2.66 10− ). The recall-precision curve × 62 showed that our method is constantly more precise at any recall rate or more sensitive at any precision.

We conclude that the LASAGNA algorithm is simple and eﬀective in aligning variable-length binding sites. It has been integrated into a user-friendly webtool for TFBS search called LASAGNA-Search. The tool currently stores precomputed PSSM models for 189 TFs and 133 TFs built from TFBSs in the TRANSFAC Public database (release 7.0) and the ORegAnno database

(08Nov10 dump), respectively. In the future, more sources of experimentally validated TFBSs such as the PAZAR database will be incorporated into the webtool, making variable-length TFBSs more accessible to scientists in the ﬁeld.

Chapter 2 LASAGNA-Search: A User-friendly Webtool for Transcription Factor Binding Site Search and Visualization

2.1 Background

In this chapter, we describe a user-friendly webtool named LASAGNA-

Search for transcription factor binding site search. We use the term position- speciﬁc weight matrix (PWM) to refer to an 4 l matrix described in the × introduction chapter, where l is the length of binding sites of the TF of interests. The term position-speciﬁc scoring matrix (PSSM) is hence used to refer to the method that scores binding sites based on a PWM or an alignment.

A typical TFBS search webtool takes a PWM and a promoter sequence as inputs and outputs putative binding sites. Many webtools implement useful features in addition to this basic function. Some tools accept variable- length binding sites [69, 24, 3] instead of a PWM. Some tools oﬀer precomputed models built from PWMs or TFBSs so users do not need to collect

PWMs or TFBSs to use a tool [67, 10, 13, 82, 78, 31]. Some tools adopt

65 a TFBS search method that exploits position dependence [69, 67]. Some tools oﬀer promoter sequence retrieval or integrate a sequence retrieval tool

[67, 83, 2, 82, 78]. For result visualization, the MAPPER2 database [67] supports visualizing hits in the UCSC Genome Browser [19] for three species.

It is also desirable to visualize the predicted binding speciﬁcities as a gene regulatory network (GRN) [31]. Not all of these useful features however are available at one single webtool.

To incorporate all the aforementioned useful features, we implemented

LASAGNA-Search: a webtool for TFBS search and visualization. LASAGNA-

Search accepts variable-length TFBSs in addition to PWMs. It oﬀers 1792 precomputed models based on TFBSs and PWMs collected from the TRANS-

FAC Public, JASPAR, ORegAnno and UniPROBE databases. Its search module exploits position dependence for a TFBS-based model whenever performance gain is indicated by cross-validation. Automatic promoter sequence retrieval is supported for 15 species at LASAGNA-Search, which enables visualization of search results in the UCSC Genome Browser for the 15 species. Search results can also be visualized along promoter sequences locally at LASAGNA-Search for any species. Finally, a GRN can be constructed from search results and visualized locally with various options.

2.2 Materials and Methods

66 Model Input User TFBSs User PWM Promoter Input Gene ID/Symbol mRNA accession PWM-based collections: TFBS-based collections: Alignment Module TRANSFAC Promoter Retrieval TRANSFAC JASPAR CORE User promoter ORegAnno UniPROBE Module PAZAR Trim Alignment Build Model

Search Module

Previous Search Results Search Results Infer Gene Regulatory Network

Visualization UCSC Genome Gene Regulatory HTML Table Plain-Text Table Local Image Browser Network

Figure 2.2.1: Architecture of LASAGNA-Search

The search module of LASAGNA-Search takes TF models and promoter sequences as inputs. TFBS-based collections contain precomputed TF models from TFBSs, while PWM-based collections include precomputed TF models from PWMs. The alignment module aligns user-provided (variable-length) TFBSs and the alignment may be manually trimmed before model building. Users may also input a PWM for model building. Promoter sequences may be provided by users or automatically retrieved by the promoter retrieval module using the NCBI Gene ID, the oﬃcial symbol or an mRNA accession number of a gene. Results produced by the search module may be displayed in a HTML or tab-delimited table. LASAGNA-Search oﬀers visualization of the results as local images, while the results can also be displayed in the UCSC Genome Browser as custom tracks. A gene regulatory network can be inferred from search results and visualized locally.

Figure 2.2.1 shows the architecture of LASAGNA-Search. We introduce the major components in the following sections.

2.2.1 Modules

2.2.1.1 Alignment Module

The alignment module aligns variable-length TFBSs so a model can be built from the alignment. It implements the LASAGNA algorithm detailed in

Chapter 1 and has been extensively compared to ClustalW2 [45] and MEME

67 [3] with favorable outcomes (see Section 1.3.1).

2.2.1.2 Search Module

The search module takes a TF model and a promoter sequence as inputs. The

TF model speciﬁes l, the length of a putative binding site, parameter K, and gives Mi(u) and Mi j(u, v) for u A, C, G, T , i [1, l k] and j [i + 1, i + K] , ∈ { } ∈ − ∈ as seen in (1.2.1). Depending on the TF, scores of nucleotide pairs may contribute to the score of a sequence. This is controlled by a parameter

K 0, the maximal distance between a nucleotide pair. The value of K is ≥ TF-dependent and is determined by cross-validation. Hence, K is greater than 0 only if nucleotide pairs improve the search performance for a TF.

Conventionally, it is assume that the first letter of an l-mer is aligned with the first position of a TF model and the l-mer is scored accordingly. Unlike the conventional approach, we align an l-mer with a TF model by sliding an l-mer and its reverse-complement through the model such that the overlap between the two is at least one nucleotide as described in Section 1.2.4. Using the framework described in Section 2.3.3, we found that this is significantly better than the conventional approach for locating TFBSs (see Figure 2.2.2).

Moreover, this approach allows easy scoring of an l-mer by a cluster of TF models of diﬀerent widths. Scoring by a cluster of TF models has been shown to outperform using only the best model in the cluster [62] and hence

68 is a feature to be included in the near future.

A Average Precision B Accuracy

70 TFs ● 70 TFs ● p−value: 1.5e−05 p−value: 4.1e−06

● 8e−05 0.8

● ●

● 6e−05 0.6

● ●

●●

● Not slide ● ● ● ● 0.4 4e−05 ● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ●● ● 0.2 ● ● ●● ● ● ●●

2e−05 ● ●● ●● ● ● ● ●● ● ●● ●●●● ●●●●● ●●● ● ●● ●● ● ●●● ●● ●●● ●● ● ●●● ● ●●●● ●●● 0.0 0e+00 0e+00 2e−05 4e−05 6e−05 8e−05 0.0 0.2 0.4 0.6 0.8

Slide Slide

Figure 2.2.2: Comparison of Scoring Strategies Using TF Models Collected by LASAGNA-Search

The search module of LASAGNA-Search (x axis) scores a binding site by sliding a putative site through a TF model, while the conventional approach (y axis) does not. The evaluation framework is described in section “Evalu- ation of precomputed TF models” of the main article. Each point in a plot corresponds to a TF, whose binding sites can be predicted by more than one model. The average performance across the models is used to plot the point. Average precision is used as the performance measure to generate (A), whereas accuracy is the performance measure for (B).

For each putative binding site or hit, the search module computes the score and the p-value, the probability of observing a score higher or equal to the score under a background distribution. We adopt the 0th-order Markov model, also known as the independent multinomial model, as the background model. To estimate the background distribution for a PSSMK model, we consider only the binding sites or PWM used to build the model. We

69 adopt this conservative way of estimating the score distribution because it is harder for a non-binding site to get a low p-value in this distribution.

For K = 0, the exact score distribution can be eﬃciently computed by con- volution [36]. However, this is not the case for K > 0 since the score can no longer be seen as a sum of independent variables. Consequently, we compute p-values using empirical score distributions. The empirical score distribution of a model is obtained by scanning random promoter sequences simulated by the background model. Speciﬁcally, we focus on only those

PSSMK scores in the upper 5% and hence scores lower than the 5 percentile

5 are assigned a p-value of 0.05+. The smallest non-zero p-value is 2.5 10− × 5 and a p-value of 0 implies any number lower than 2.5 10− . ×

While the p-values are not corrected for multiple testing, they are useful for ordering hits found by diﬀerent TF models. To take into account the length of the promoter sequence in which a hit is found, an E-value is computed for the hit. An E-value gives the expected number of times a hit of the same or higher score is found in the promoter sequence by chance. Let L be the length of the promoter sequence and l be the length of the putative binding site. E-value = p-value (L l + 1), which is approximately p-value L when × − × L l.

2.2.1.3 Promoter Retrieval Module

Currently, LASAGNA-Search supports retrieving promoter sequences for

70 15 species: Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Saccharomyces cerevisiae, Caenorhabditis elegans, Caenorhab- ditis briggsae, Bos taurus, Sus scrofa, Ovis aries, Gallus gallus, Canis lupus familiaris, Felis catus, Xenopus (Silurana) tropicalis and Danio rerio. Users may enter the NCBI Gene identifier (ID), the official gene symbol or an mRNA accession number of a gene to retrieve its upstream promoter region. The upstream region of a gene is specified by positions relative to the transcription start site (TSS) obtained from the UCSC Genome Browser

[19]. Information in the NCBI Gene database is used for conversion between

Gene IDs and symbols.

2.2.1.4 Gene Regulator Network Inference

LASAGNA-Search automatically constructs a gene regulatory network based on search results. A directed edge from a TF model to a gene is established if at least a signiﬁcant hit is found in the promoter region of the gene by the

TF model. The lowest p-value of these hits is used to compute the weight on this edge. That is, the thickness of the edge is proportional to log p-value. − In case the coding genes of a TF model are known, these genes may be added to the network with dotted arrows from the genes to the TF model. To simplify the network, the node for a TF model may be removed, leaving only its coding genes in the network. Figure 2.2.3 shows an example network of human genes TP53 and MYB. Visualization of gene regulatory networks at

71 A B C

Figure 2.2.3: An Inferred Gene Regulatory Network of Human Genes TP53 and MYB

A small gene regulatory network of human genes TP53 and MYB inferred from scanning the promoters (950bp upstream to 50bp downstream) using two TP53 TF models and two MYB TF models. Genes are denoted by green ellipses and TF models are represented by red octagons. (A) The inferred network containing the 2 genes and the 4 TF models. (B) The dotted edges from the 2 coding genes to the 4 TF models are established in this network. (C) The simpliﬁed network after removing 4 nodes. These nodes are removed because the two TP53 TF models are coded by the TP53 gene and likewise for MYB.

LASAGNA-Search is enabled by Cytoscape Web [57]. We describe how the networks in Figure 2.2.3 were generated in Section 2.3.1.

2.2.2 TF Model Collections

LASAGNA-Search currently oﬀers 6 precomputed TF model collections.

The collections are categorized by the type of data used to build a model.

Table 2.2.9 lists the type and number of models for each collection. To facilitate gene regulatory network visualization, we mapped TF models to genes coding for the TFs. The number of models that can be mapped is also listed in Table 2.2.9 for each collection. Models in the TFBS-based

72 Table 2.2.9: Summary of TF Model Collections

Database TypeModelsMapped Models1 TRANSFAC TFBS 189 188 ORegAnno TFBS 133 132 PAZAR TFBS 66 66 TRANSFAC PWM 398 366 JASPAR COREPWM 476 457 UniPROBE PWM 530 524 Total 1792 1733 1Models of TFs whose coding genes were found.

collections were built from unaligned TFBSs, while models in the PWM- based collections were built from PWMs. We describe the two categories in the following sections.

2.2.2.1 TFBS-based Collections

We collected experimentally validated transcription factor binding sites from the TRANSFAC Public database and the ORegAnno database. In these two collections, binding sites of a TF were not collected across species.

TF models are non-redundant in the sense that a TF of a species has only one model based on all the available binding sites in a database. The binding sites of a TF were aligned by the alignment module to build a model. We built one model for each TF because, for most TFs, the binding aﬃnity can be explained by only one model [92]. In case a TF recognizes more than one motif [25]), we rely on database curators to distinguish binding sites belonging to distinct motifs. Moreover, the TFBS-based collections are com-

73 pensated by our PWM-based collections, which oﬀer more than one model for some TFs.

Binding sites of 5 species were collected from the TRANSFAC Public database

(release 7.0) [59], including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster and Saccharomyces cerevisiae. For each species, a TF was included in our collection if it contains at least 10 binding sites. Totally, binding sites for 189 TFs across 5 species were collected.

Although TRANSFAC does build PWMs for TFs, 72 (38.1%) of them do not have any PWMs in TRANSFAC.

Besides the 5 species present in the TRANSFAC collection, binding sites of

Caenorhabditis elegans and Caenorhabditis briggsae were collected from the ORegAnno database (08Nov10 dump) [32]. Being an open-annotation database, ORegAnno allows users to adopt the role of curators and contribute binding sites and other types of annotations to the database. A nice feature is allowing users to enter a NCBI or Ensembl ID for each gene or transcription factor mention. This feature allows easy mapping of distinct mentions of the same TF to a unique database ID so binding sites of a TF contributed by diﬀerent users can be easily merged. Nevertheless, many TF mentions in ORegAnno are not accompanied by a database ID. In this case,

74 we automatically assign the NCBI Gene ID to a TF mention by consulting the NCBI Gene database. We note that this is not always possible since a TF mention may be the symbol of one gene and a synonym of another and hence cannot be uniquely mapped. Ambiguity was manually resolved when a TF mention cannot be uniquely mapped to a NCBI Gene ID. Still, some TF mentions are protein complexes and hence cannot be identiﬁed by a single gene ID. These mentions were semi-automatically collapsed. Finally, binding sites of 133 TFs across 7 species were collected, where each TF has at least 10 binding sites.

The PAZAR database [64] oﬀers an platform for users to start curation projects. A record stores one annotation for one sequence from either an TF- gene interaction or gene expression experiment. Hence, binding sites of a

TF can be extracted from TF-gene interaction records in the PAZAR projects.

Since more than one project may curate binding sites of a particular TF, we aggregated records containing TF-gene interaction information from all the public projects. All the ﬁles in the general feature format dated 20120117 were downloaded. We group TFBSs by TF and species, that is, human TF A and mouse TF A are considered two TFs.

A binding site was ﬁltered out if it is less than 4 or greater than 1000 bases long. To verify a binding site, we searched for it in the vicinity of the curated

75 genomic location in the reference genome. The binding site was discarded if it couldn’t be located within 5 bases of the curated location. As we collected binding sites for a TF across all the public projects in PAZAR, a TFBS may be curated by more than one project, resulting in multiple copies of the TFBS in our collection. Therefore, for each pair of overlapping binding sites, we kept only the shorter one if the overlap is more than 80% of the shorter one in length. A model was built for a TF if it has at least 10 binding sites.

The LASAGNA-ChIP algorithm [50] was used to align the binding sites of a

TF since some of the projects contain TFBSs identiﬁed by ChIP-seq and ChIP- chip experiments. As reported in [38], about 94% of the actual binding sites can be located within 50 bases of signal peaks. However, no clipping was done for sequences produced by ChIP-seq experiments since information about the signal peak is not available in PAZAR. The new collection contains

66 TF models, 39, 20 and 7 of which are human, mouse and rat, respectively.

As seen in Table 2.2.9, nearly all the TF models in the two collections were mapped to TF coding genes. Only one model in each collection remains unmapped due to lack of information in the source databases. They are ETF

(T00270) and MYF in TRANSFAC and ORegAnno, respectively.

2.2.2.2 PWM-based Collections

In addition to binding sites, we also collected position-speciﬁc weight matrices (PWMs) from the TRANSFAC Public database, the JASPAR CORE

76 database [10] and the UniPROBE database [60]. A PWM is a 4 l matrix, × where l is the length of binding sites. Each element in column i of a PWM is usually the count or probability of a nucleotide at position i. PWMs are valuable resources for various reasons. One reason is that most PWMs in

TRANSFAC and JASPAR were built by domain experts. For instance, some

PWMs in TRANSFAC and JASPAR CORE were based on binding sites of more than one species because of cross-species conservation (e.g. TRANS-

FAC matrix M00152). Moreover, a PWM in TRANSFAC may be based on binding sites of more than one TF because of similar binding speciﬁcities

(e.g. TRANSFAC matrix M00158). Another reason is that for some techniques no binding sites but only matrices are produced. The UniPROBE database, for example, stores data from protein binding microarray (PBM) experiments [6]. The PBM technique assigns a binding specificity score to each 10-mer sequence variant. Berger and Bulyk [6], however, do not suggest setting a specificity cut-off threshold to report binding sites. Instead,

PWMs are produced by the Seed-and-Wobble algorithm.

From the UniPROBE database, we collected 530 PWMs of 6 species: Homo sapiens, Mus musculus, Saccharomyces cerevisiae, Caenorhabditis elegans,

Plasmodium falciparum and Cryptosporidium parvum. These 530 PWMs correspond to 414 non-redundant TFs (proteins or protein complexes). We

77 collected 476 PWMs from the JASPAR CORE database, where the PWMs were categorized into 6 species groups: Vertebrates, Insects, Plants, Fungi,

Nematodes and Urochordates. Finally, 398 PWMs were collected from the

TRANSFAC Public database and grouped into Vertebrates, Insects, Plants,

Fungi, Nematodes and Bacteria.

According to Table 2.2.9, the PWM-based collections contain more unmapped TF models than the TFBS-based collections. Lack of information in the source databases is the major reason. Some matrices such as MA0102.1 and MA0061.1 in the JASPAR CORE database were built from TFBSs of more than one species but accession numbers of the homologous proteins are not available. Some matrices in the TRANSFAC and JASPAR CORE databases have protein accession numbers available, while records of the corresponding coding genes cannot be found in the NCBI Gene database.

These proteins often belong to species such as Pisum sativum and Triticum aestivum, which are not as well-studied as model organisms.

2.3 Results and Discussion

In this section, we introduce the user interface, followed by a comparison of features to existing webtools, and evaluation of precomputed TF models in

LASAGNA-Search and MAPPER2. Finally, we discuss future directions for

78 improving LASAGNA-Search.

2.3.1 User Interface

2.3.1.1 Input Page

The LASAGNA-Search input page is divided into three parts, one for TF model input, one for promoter sequence input and one for result ﬁltering parameter input. Figure 2.3.1a shows a screenshot of the input page. Two options are available for result ﬁltering. One is setting a p-value threshold so that only hits with equal or lower p-values will be reported. The other is setting k so that only k hits with the highest scores will be reported.

For TF model input, LASAGNA-Search accepts variable-length TFBSs for model building. Users may input TFBSs in the FASTA format. The TF-

BSs will be aligned on clicking the “Start Searching” button. The PWM and sequence logo [15] of the automatically trimmed alignment will be displayed. Users may choose to further trim the alignment or recover previously trimmed columns. Figure 2.3.1c shows the user interface for TFBS alignment trimming. In addition to TFBSs, users may input a PWM for model building. LASAGNA-Search recognizes formats used by JASPAR,

TRANSFAC and UniPROBE.

LASAGNA-Search currently oﬀers two ways of selecting models in the

79 TFBS-based and PWM-based collections. One is to browse each model collection, while the other is to search by keywords for models in all the collections. To browse a collection, users may click the radio button for the collection to browse models by species or species group. A model can then be added to the “shopping cart” by marking the model with a tick. Remov- ing a tick mark will remove the corresponding model from the “shopping cart”. To search for models, users may enter one or more keywords and click the “Search” button. The models found will be displayed in a list and can be similarly selected or removed (see Figure 2.3.1b for an example). The number of selected models is displayed on the input page. Users may click the “Show” button to view these models and remove the unwanted ones.

For promoter sequence input, users may input their own promoter sequences in the FASTA format. However, users may retrieve promoter sequences by NCBI Gene IDs, gene symbols, or mRNA accession numbers. By clicking the “Search” button, LASAGNA-Search will display the matching promoters. Figure 2.3.1d shows the promoters found by keywords CCND1 and MYB. Users may choose to examine only promoters of a particular species. Only the matching human promoters are listed in Figure 2.3.1d after applying the ﬁlter. Promoters are selected in a manner similar to selecting TF models. Finally, users may also select from a list of randomly

80 sampled promoters of a chosen species.

2.3.1.2 Result Page

The result page is organized into 5 tabs. The ﬁrst tab displays hits on all the promoter sequences, whereas the second tab displays hits pertaining to one promoter sequence at a time. The third tab shows the gene regulatory network inferred from search results. The fourth tab allows importing previous search results to be merged with the current search results. The last tab contains the inputs, including the selected TF models, the selected promoters and the search parameters. Figure 2.3.2 shows an example result page with the third tab named “Promoter view” showing.

Only hits meeting the speciﬁed criterion are reported in the ﬁrst and second tabs. For each hit, the model name, sequence, 0-based position, strand, score, p-value and E-value are reported. Hits found in the same promoter sequence can be sorted by model name, sequence, position, strand, p-value and E-value by clicking the respective column header. By default, the hits are displayed in a HTML table. Users may click a button on the result page to obtain the table in the tab-delimited format. Previous search results in the tab-delimited format can be easily imported to the current search results.

This is particularly useful when additional TFs of interests to the user are identiﬁed after an initial search.

81 Users may display the hits along the promoter sequence, where the log p- − value of each hit is used as the height to plot a box. This allows easy visualization of the predicted binding sites by a model in the context of those by the other models. Finally, the hits can be saved in GFF (general feature format) or the bedGraph format for visualization in the UCSC Genome

Browser [19]. Links are provided for each promoter sequence to automatically create a custom track and redirect users to the UCSC Genome Browser.

Figure 2.3.3 shows a custom track of putative binding sites predicted by

LASAGNA-Search in the context of 4 other relevant tracks.

The automatically inferred GRN can be displayed and manipulated by clicking the tab named “Gene regulatory network”. To produce a sparser network, users may set a more stringent p-value than the one used to ﬁlter hits.

Users may show only nodes belonging to one or more species listed under “Filter by species” Figure 2.2.3a shows the network after restricting the species to Homo sapiens. Users may choose to display the TF coding genes by checking “Map TFs to coding genes”. Figure 2.2.3b shows the resulting network. While 6 nodes are present in the GRN in Figure 2.2.3b, there are essentially only 2 genes and their products in the network. When a GRN involves more genes, it may be desirable to simplify the GRN, replacing the

82 TF models with their respective coding genes. Figure 2.2.3c displays the simpliﬁed two node GRN generated by checking “Simple network”. We note that a GRN can be simpliﬁed only after the TF models are mapped to coding genes.

2.3.2 Comparison of Features to Existing Webtools

LASAGNA-Search was designed to allow users to scan promoters for TFBSs without leaving the LASAGNA-Search page. Many features of LASAGNA-

Search were developed for user convenience reasons. Hence, without the knowledge of PWM or TFBS databases and promoter sequence retrieval tools, users can start searching for binding sites in a promoter sequence and visualize the hits in the UCSC Genome Browser immediately. There are several integrative TFBS search webtools available. It is useful to compare

LASAGNA-Search with the existing webtools to better understand the advantages and disadvantages of LASAGNA-Search and suggest future work to improve LASAGNA-Search. Table 2.3.10 summarizes the comparison of LASAGNA-Search to matrix-scan and the search engine of MAPPER2 database for TFBS search.

In terms of input TF models, LASAGNA-Search and MAPPER2 have large libraries of TF models, while users need to collect PWMs before using matrix- scan. Users may input a PWM or unaligned TFBSs to LASAGNA-Search for

83 model building. On the other hand, while matrix-scan accepts PWMs, both matrix-scan and MAPPER2 do not accept unaligned TFBSs. For promoter sequences, all the three tools accept sequences in FASTA, while matrix-scan handles sequences in 5 additional formats. Automatic sequence retrieval for matrix-scan is accomplished by interfacing with two tools, “retrieve sequence” and “retrieve EnsEMBL sequence”, on the same website. These two tools are capable of retrieving sequences in a wide range of species and can be used with any TFBS search tools. LASAGNA-Search and MAPPER2 oﬀer integrated promoter retrieval tools supporting 7 and 3 species, respectively.

Visualization of predicted binding sites is usually tightly connected with the promoter sequence retrieval used by a tool. This is because to create a custom track in the UCSC Genome Browser, the genome build (release version) and coordinates in the genome must be known for a promoter sequence. For

LASAGNA-Search and MAPPER2, hits found on any promoter sequences retrieved by the provided tool can be visualized with ease in the UCSC

Genome Browser. Therefore, 7 and 3 species are supported by LASAGNA-

Search and MAPPER2, respectively. Visualizing hits found by matrix-scan in the UCSC Genome Browser is possible only when the genome build and coordinates are speciﬁed in the FASTA header of the promoter sequence.

Headers of sequences retrieved by the aforementioned two tools, however,

84 do not contain the required information enabling visualization of hits in the

UCSC Genome Browser.

Gene regulatory network inference from search results is only available at

LASAGNA-Search among the three integrative webtools. PAINT [31] offers similar function by integrating Match [40] in the TRANSFAC Public or

Professional databases and promoter sequence retrieval for human, mouse and rat. Compared to PAINT, LASAGNA-Search contains 1792 TF models from 5 source databases and retrieves promoters for 15 species. The major diﬀerence between LASAGNA-Search and PAINT, however, is that

LASAGNA-Search keeps track of the coding genes of TF models. This is an important feature because it allows visualization of self-regulation as self- loops and merging nodes for TF models coded by the same genes.

Finally, it is useful to compare LASAGNA-Search to other relevant webtools.

The MEME Suite [2] offers web interfaces to 4 TFBS search tools with access to whole-genome promoter sequences. However, these tools have no access to the PWM database in the suite, nor do they scan promoters of specific genes and offer visualization of hits. Two tools motivated by evolutionary conservation are COTRASIF [82] and ReXSpecies2 [78]. COTRASIF collects

138 JASPAR and 398 TRANSFAC PWMs and oﬀer whole-genome Ensembl

85 promoter sequences. However, it does not allow selection of gene-specific promoter sequences nor does it offer visualization. ReXSpecies2, on the other hand, sources PWMs from JASPAR, scans promoters of specific genes and allows visualization in the UCSC Genome Browser. However, it focuses only on human and mouse and selecting individual PWMs requires use of regular expression-like patterns.

2.3.3 Evaluation of Precomputed TF Models

Since MAPPER2 is the most comparable webtool to LASAGNA, it is useful to compare the TF model collections oﬀered by LASAGNA-Search and MAP-

PER2 on a whole-genome basis. As the MAPPER2 database stores, for each

TF model, hits from scanning the 10Kbp upstream region of each transcript, we scanned the same sequences using TF models oﬀered by LASAGNA-

Search. We have no access to the proﬁle hidden Markov models [20] used by MAPPER2 and the dynamic scanning interface oﬀered by MAPPER2 was not functioning at the time of writing. Fortunately, MAPPER2 allows downloading of the top-1000 hits for each model. We hence limited the comparison to the top-1000 hits produced by each TF model.

To evaluate model performance, human and mouse ChIP-seq data from the

ENCODE project [68] was used as the gold-standard. Hence, we considered models for human and mouse TFs. The comparison was performed

86 on a per-TF basis and all the TFs that can be validated were included. Ta- bles 2.3.11 and 2.3.12 list the ChIP-seq tracks (experiments) by TF for human and mouse, respectively. We associated each TF with models that can be used to predict its binding sites. Each of the 1000 hits produced by a model was checked against the ChIP-seq peaks of the TF. A hit is marked a true positive if it is completely covered by a peak in at least one experiment as

ChIP-seq peaks are much longer than TFBSs. Otherwise, a hit is marked a false positive.

87 AB

Figure 2.3.1: Input Page of LASAGNA-Search

User interface of the LASAGNA-Search input page. (A) The input page when the “Enter known TFBSs” radio button is checked. (A) Selecting TF models by keyword search. The models matching keywords p53 and STAT3 are listed in the table. (C) Manual trimming of TFBS alignments. Automatically trimmed alignment of 11 NF-Y binding sites is presented as a PWM and a sequence logo for manual trimming. Users may choose to further trim the alignment or to recover previously trimmed columns by clicking the 4 buttons on the bottom. (D) Selecting promoters by keyword search. Promoters found by the keywords CCND1 and MYB are shown in the table. Only promoters belonging to Homo sapiens are shown.

88 LASAGNA-Search Results http://biogrid.engr.uconn.edu/lasagna_search/lasagna...

All results Promoter view Gene regulatory network Import results Inputs

Homo sapiens chr11 + 69455873 CCND1 NM_053056

Homo sapiens chr6 + 135502453 MYB NM_001130173

Transcripts

Species Chr Strand TSS Symbol mRNA Range Homo sapiens chr6 + 135502453 MYB NM_001130173 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_001130172 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_001161656 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_001161657 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_001161658 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_001161659 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_001161660 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_005375 -950 to +50

Hits

gi|224589818:135501503-135502502 Homo sapiens chromosome 6, GRCh37.p9 Primary Assembly Position Name Sequence Strand Score p-value E-value (0-based) AML-1a AGCGGT 501 - 7.31 0 0 (M00271) AML-1a AGCGGT 275 + 7.31 0 0 (M00271) STAT5A GAGTTCTG 1 + 8.85 0 0 (M00499) AML-1a TGCGGT 125 + 7.2 0.000125 0.124 (M00271)Figure 2.3.2: Result Page of LASAGNA-Search

The LASAGNA-Search resultFinished page retrieving with the results... “Promoter view” tab showing. Users may examine the hits on individual promoter sequences by clicking the respective title bars. The content for the MYB promoter is shown. The ﬁrst table lists the transcripts corresponding to this sequence. The second table displays the hits found on this promoter sequence.

1 of 1 08/08/2012 02:53 PM

89 Scale 200 bases hg19 chr6: 135,501,800 135,501,900 135,502,000 135,502,100 135,502,200 135,502,300 135,502,400 135,502,500 135,502,600 RefSeq Genes MYB MYB MYB MYB MYB MYB MYB MYB LASAGNA-Search Predicted TFBSs AP-2(M00189) c-Myb(M00183) Myb(MA0100.1) SRF(M00215) NF-kappaB(M00194) Tcfap2a(UP00005.Tcfap2a.secondary) Srf(UP00077.Srf.primary) NF-kappaB(M00208) NFKB1(MA0105.1) SRF(M00186) NF-kappaB(p65)(M00052) NF-kappaB(p50)(M00051) E2F(M00024) NF-kappaB(M00054) E2F1(MA0024.1) NF-kappaB(MA0061.1) E2F(M00050) E2F(M00516) 50 _ H1-hESC H3K4me3 Histone Mods by ChIP-seq Signal from ENCODE/Broad H1-hESC H3K4m3 1 _ 50 _ K562 H3K4me3 Histone Mods by ChIP-seq Signal from ENCODE/Broad K562 H3K4m3 1 _ 4.41 _ GERP scores for mammalian alignments

GERP 0 -

-8.81 _ Figure 2.3.3: Visualization of Hits in the UCSC Genome Browser

The hits were produced by scanning the 950bp upstream to 50bp downstream promoter region of human gene MYB with 27 TF models. Only signiﬁcant hits (p-value 0.001) were retained. The hits in the GFF format are displayed in pack mode≤ with 4 other tracks. Hits predicted by the same TF model are connected by a line. The RefSeq Genes track shows the relative hit positions to gene MYB. Two histone methylation tracks and the GERP (Genomic Evolutionary Rate Proﬁling) track [16] are also shown. The GERP score is a measure of evolutionary conservation and TFs are known to preferentially bind motif instances in conserved regions [14]. Histone methylation has been shown to be the most important factor in predicting the general binding preference of TFs [21].

90 Table 2.3.10

LASAGNA-Search matrix-scan MAPPER2 TF Model User PWM Yes Yes No User TFBSs Yes, unaligned, No Yes, aligned TFBSs. variable-length TFBSs. Model Yes, 1792 Not available Yes, 1017 collection TFBS-based and TFBS-based PWM-based models. models. Promoter Sequence Format FASTA FASTA and 5 other FASTA formats. Retrieval tool Yes, built-in for 15 Yes, comprehensive Yes, built-in for 3 species. species coverage by species. tools retrieve sequence and retrieve EnsEMBL sequence on the same website. Search Result Filtering p-value p-value E-value Local Yes Yes Yes visualization Visualization Yes, supports 15 Limited. The build, Yes, supports 3 in UCSC species. coordinates and species. Genome orientation must be Browser speciﬁed in the FASTA sequence header GRN inference Yes No No

91 Table 2.3.11: Human ENCODE Tracks Used for Validating TF Models

TF Antibody Track

ATF1 Atf106325 wgEncodeSydhTfbsK562Atf106325StdPk

ATF2 Atf2sc81188 wgEncodeHaibTfbsGm12878Atf2sc81188V0422111PkRep1

wgEncodeHaibTfbsGm12878Atf2sc81188V0422111PkRep2

wgEncodeHaibTfbsH1hescAtf2sc81188V0422111PkRep1 92 wgEncodeHaibTfbsH1hescAtf2sc81188V0422111PkRep2

ATF3 Atf3 wgEncodeHaibTfbsA549Atf3V0422111Etoh02PkRep1

wgEncodeHaibTfbsA549Atf3V0422111Etoh02PkRep2

wgEncodeHaibTfbsGm12878Atf3Pcr1xPkRep1

wgEncodeHaibTfbsGm12878Atf3Pcr1xPkRep2

wgEncodeHaibTfbsH1hescAtf3V0416102PkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsH1hescAtf3V0416102PkRep2

wgEncodeHaibTfbsHct116Atf3V0422111PkRep1

wgEncodeHaibTfbsHct116Atf3V0422111PkRep2

wgEncodeHaibTfbsHepg2Atf3V0416101PkRep1

93 wgEncodeHaibTfbsHepg2Atf3V0416101PkRep2

wgEncodeHaibTfbsK562Atf3V0416101PkRep1

wgEncodeHaibTfbsK562Atf3V0416101PkRep2

wgEncodeSydhTfbsK562Atf3StdPk

CEBPB Cebpb wgEncodeSydhTfbsA549CebpbIggrabPk

wgEncodeSydhTfbsH1hescCebpbIggrabPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeSydhTfbsHelas3CebpbIggrabPk

wgEncodeSydhTfbsHepg2CebpbForsklnStdPk

wgEncodeSydhTfbsHepg2CebpbIggrabPk

wgEncodeSydhTfbsImr90CebpbIggrabPk

94 wgEncodeSydhTfbsK562CebpbIggrabPk

Cebpbsc150 wgEncodeHaibTfbsA549Cebpbsc150V0422111PkRep1

wgEncodeHaibTfbsA549Cebpbsc150V0422111PkRep2

wgEncodeHaibTfbsEcc1Cebpbsc150V0422111PkRep1

wgEncodeHaibTfbsEcc1Cebpbsc150V0422111PkRep2

wgEncodeHaibTfbsGm12878Cebpbsc150V0422111PkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsGm12878Cebpbsc150V0422111PkRep2

wgEncodeHaibTfbsHct116Cebpbsc150V0422111PkRep1

wgEncodeHaibTfbsHct116Cebpbsc150V0422111PkRep2

wgEncodeHaibTfbsHepg2Cebpbsc150V0416101PkRep1

95 wgEncodeHaibTfbsHepg2Cebpbsc150V0416101PkRep2

wgEncodeHaibTfbsK562Cebpbsc150V0422111PkRep1

wgEncodeHaibTfbsK562Cebpbsc150V0422111PkRep2

wgEncodeHaibTfbsMcf7Cebpbsc150V0422111PkRep1

wgEncodeHaibTfbsMcf7Cebpbsc150V0422111PkRep2

CREB1 Creb1sc240 wgEncodeHaibTfbsA549Creb1sc240V0416102Dex100nmPkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsA549Creb1sc240V0416102Dex100nmPkRep2

wgEncodeHaibTfbsEcc1Creb1sc240V0422111PkRep1

wgEncodeHaibTfbsEcc1Creb1sc240V0422111PkRep2

wgEncodeHaibTfbsGm12878Creb1sc240V0422111PkRep1

96 wgEncodeHaibTfbsGm12878Creb1sc240V0422111PkRep2

wgEncodeHaibTfbsH1hescCreb1sc240V0422111PkRep1

wgEncodeHaibTfbsH1hescCreb1sc240V0422111PkRep2

wgEncodeHaibTfbsHepg2Creb1sc240V0422111PkRep1

wgEncodeHaibTfbsHepg2Creb1sc240V0422111PkRep2

wgEncodeHaibTfbsK562Creb1sc240V0422111PkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsK562Creb1sc240V0422111PkRep2

CUX1 Cdpsc6327 wgEncodeSydhTfbsGm12878Cdpsc6327IggmusPk

wgEncodeSydhTfbsK562Cdpsc6327IggrabPk

E2F1 E2f1 wgEncodeSydhTfbsHelas3E2f1StdPk

97 Hae2f1 wgEncodeSydhTfbsHelas3Hae2f1StdPk

wgEncodeSydhTfbsMcf7Hae2f1UcdPk

E2F4 E2f4 wgEncodeSydhTfbsGm12878E2f4IggmusPk

wgEncodeSydhTfbsHelas3E2f4StdPk

wgEncodeSydhTfbsK562E2f4UcdPk

wgEncodeSydhTfbsMcf10aesE2f4TamHvdPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

ELK1 Elk112771 wgEncodeSydhTfbsGm12878Elk112771IggmusPk

wgEncodeSydhTfbsHelas3Elk112771IggrabPk

wgEncodeSydhTfbsK562Elk112771IggrabPk

ELK4 Elk4 wgEncodeSydhTfbsHek293Elk4UcdPk

98 wgEncodeSydhTfbsHelas3Elk4UcdPk

EP300 P300 wgEncodeHaibTfbsA549P300V0422111Etoh02PkRep1

wgEncodeHaibTfbsA549P300V0422111Etoh02PkRep2

wgEncodeHaibTfbsEcc1P300V0422111PkRep1

wgEncodeHaibTfbsEcc1P300V0422111PkRep2

wgEncodeHaibTfbsGm12878P300Pcr1xPkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsGm12878P300Pcr1xPkRep2

wgEncodeHaibTfbsH1hescP300V0416102PkRep1

wgEncodeHaibTfbsH1hescP300V0416102PkRep2

wgEncodeHaibTfbsHepg2P300V0416101PkRep1

99 wgEncodeHaibTfbsHepg2P300V0416101PkRep2

wgEncodeHaibTfbsMcf7P300V0422111PkRep1

wgEncodeHaibTfbsMcf7P300V0422111PkRep2

wgEncodeHaibTfbsSknshP300V0422111PkRep1

wgEncodeHaibTfbsSknshP300V0422111PkRep2

wgEncodeHaibTfbsSknshraP300V0416102PkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsSknshraP300V0416102PkRep2

wgEncodeHaibTfbsT47dP300V0416102Dm002p1hPkRep1

wgEncodeHaibTfbsT47dP300V0416102Dm002p1hPkRep2

wgEncodeSydhTfbsGm12878P300IggmusPk

100 wgEncodeSydhTfbsK562P300IggrabPk

P300sc582 wgEncodeSydhTfbsHepg2P300sc582IggrabPk

P300sc584 wgEncodeSydhTfbsGm12878P300sc584IggmusPk

P300sc584sc48343wgEncodeSydhTfbsK562P300sc584sc48343IggrabPk

P300sc584sc584 wgEncodeSydhTfbsHelas3P300sc584sc584IggrabPk

ESR1 Eraa wgEncodeHaibTfbsEcc1EraaV0416102Bpa1hPkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsEcc1EraaV0416102Bpa1hPkRep2

wgEncodeHaibTfbsT47dEraaV0416102Bpa1hPkRep1

wgEncodeHaibTfbsT47dEraaV0416102Bpa1hPkRep2

Eralphaa wgEncodeHaibTfbsEcc1EralphaaV0416102Est10nm1hPkRep1

101 wgEncodeHaibTfbsEcc1EralphaaV0416102Est10nm1hPkRep2

wgEncodeHaibTfbsEcc1EralphaaV0416102Gen1hPkRep1

wgEncodeHaibTfbsEcc1EralphaaV0416102Gen1hPkRep2

wgEncodeHaibTfbsT47dEralphaaPcr2xGen1hPkRep1

wgEncodeHaibTfbsT47dEralphaaPcr2xGen1hPkRep2

wgEncodeHaibTfbsT47dEralphaaV0416102Est10nm1hPkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsT47dEralphaaV0416102Est10nm1hPkRep2

ETS1 Ets1 wgEncodeHaibTfbsA549Ets1V0422111Etoh02PkRep1

wgEncodeHaibTfbsA549Ets1V0422111Etoh02PkRep2

wgEncodeHaibTfbsGm12878Ets1Pcr1xPkRep1

102 wgEncodeHaibTfbsGm12878Ets1Pcr1xPkRep1V2

wgEncodeHaibTfbsGm12878Ets1Pcr1xPkRep2

wgEncodeHaibTfbsGm12878Ets1Pcr1xPkRep2V2

wgEncodeHaibTfbsK562Ets1V0416101PkRep1

wgEncodeHaibTfbsK562Ets1V0416101PkRep2

FOS Cfos wgEncodeSydhTfbsGm12878CfosStdPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeSydhTfbsHelas3CfosStdPk

wgEncodeSydhTfbsHuvecCfosUcdPk

wgEncodeSydhTfbsK562CfosStdPk

wgEncodeSydhTfbsMcf10aesCfosEtoh01HvdPk

103 wgEncodeSydhTfbsMcf10aesCfosTam112hHvdPk

wgEncodeSydhTfbsMcf10aesCfosTam14hHvdPk

wgEncodeSydhTfbsMcf10aesCfosTamHvdPk

Efos wgEncodeUchicagoTfbsK562EfosControlPk

FOSL1 Fosl1 wgEncodeHaibTfbsHct116Fosl1V0422111PkRep1

wgEncodeHaibTfbsHct116Fosl1V0422111PkRep2

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

Fosl1sc183 wgEncodeHaibTfbsH1hescFosl1sc183V0416102PkRep1

wgEncodeHaibTfbsH1hescFosl1sc183V0416102PkRep2

wgEncodeHaibTfbsK562Fosl1sc183V0416101PkRep1

wgEncodeHaibTfbsK562Fosl1sc183V0416101PkRep2

104 FOXM1 Foxm1sc502 wgEncodeHaibTfbsEcc1Foxm1sc502V0422111PkRep1

wgEncodeHaibTfbsEcc1Foxm1sc502V0422111PkRep2

wgEncodeHaibTfbsGm12878Foxm1sc502V0422111PkRep1

wgEncodeHaibTfbsGm12878Foxm1sc502V0422111PkRep2

wgEncodeHaibTfbsMcf7Foxm1sc502V0422111PkRep1

wgEncodeHaibTfbsMcf7Foxm1sc502V0422111PkRep2

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsSknshFoxm1sc502V0422111PkRep1

wgEncodeHaibTfbsSknshFoxm1sc502V0422111PkRep2

GABPA Gabp wgEncodeHaibTfbsA549GabpV0422111Etoh02PkRep1

wgEncodeHaibTfbsA549GabpV0422111Etoh02PkRep2

105 wgEncodeHaibTfbsGm12878GabpPcr2xPkRep1

wgEncodeHaibTfbsGm12878GabpPcr2xPkRep2

wgEncodeHaibTfbsH1hescGabpPcr1xPkRep1

wgEncodeHaibTfbsH1hescGabpPcr1xPkRep2

wgEncodeHaibTfbsHelas3GabpPcr1xPkRep1

wgEncodeHaibTfbsHelas3GabpPcr1xPkRep2

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsHepg2GabpPcr2xPkRep1

wgEncodeHaibTfbsHepg2GabpPcr2xPkRep2

wgEncodeHaibTfbsHl60GabpV0422111PkRep1

wgEncodeHaibTfbsHl60GabpV0422111PkRep2

106 wgEncodeHaibTfbsK562GabpV0416101PkRep1

wgEncodeHaibTfbsK562GabpV0416101PkRep2

wgEncodeHaibTfbsMcf7GabpV0422111PkRep1

wgEncodeHaibTfbsMcf7GabpV0422111PkRep2

wgEncodeHaibTfbsSknshGabpV0422111PkRep1

wgEncodeHaibTfbsSknshGabpV0422111PkRep2

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

GATA3 Gata3 wgEncodeHaibTfbsA549Gata3V0422111PkRep1

wgEncodeHaibTfbsA549Gata3V0422111PkRep2

wgEncodeHaibTfbsMcf7Gata3V0422111PkRep1

wgEncodeHaibTfbsMcf7Gata3V0422111PkRep2

107 wgEncodeHaibTfbsSknshGata3V0422111PkRep1

wgEncodeHaibTfbsSknshGata3V0422111PkRep2

wgEncodeSydhTfbsMcf7Gata3UcdPk

Gata3sc268 wgEncodeHaibTfbsT47dGata3sc268V0416102Dm002p1hPkRep1

wgEncodeHaibTfbsT47dGata3sc268V0416102Dm002p1hPkRep2

Gata3sc269 wgEncodeSydhTfbsMcf7Gata3sc269UcdPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

Gata3sc269sc269 wgEncodeSydhTfbsShsy5yGata3sc269sc269UcdPk

HNF4A Hnf4a wgEncodeSydhTfbsHepg2Hnf4aForsklnStdPk

Hnf4asc8987 wgEncodeHaibTfbsHepg2Hnf4asc8987V0416101PkRep1

wgEncodeHaibTfbsHepg2Hnf4asc8987V0416101PkRep2

108 HSF1 Hsf1 wgEncodeSydhTfbsHepg2Hsf1ForsklnStdPk

IRF1 Irf1 wgEncodeSydhTfbsK562Irf1Ifna30StdPk

wgEncodeSydhTfbsK562Irf1Ifna6hStdPk

wgEncodeSydhTfbsK562Irf1Ifng30StdPk

wgEncodeSydhTfbsK562Irf1Ifng6hStdPk

JUN Cjun wgEncodeSydhTfbsH1hescCjunIggrabPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeSydhTfbsHelas3CjunIggrabPk

wgEncodeSydhTfbsHepg2CjunIggrabPk

wgEncodeSydhTfbsHuvecCjunStdPk

wgEncodeSydhTfbsK562CjunIfna30StdPk

109 wgEncodeSydhTfbsK562CjunIfna6hStdPk

wgEncodeSydhTfbsK562CjunIfng30StdPk

wgEncodeSydhTfbsK562CjunIfng6hStdPk

wgEncodeSydhTfbsK562CjunIggrabPk

wgEncodeSydhTfbsK562CjunStdPk

JUNB Ejunb wgEncodeUchicagoTfbsK562EjunbControlPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

JUND Ejund wgEncodeUchicagoTfbsK562EjundControlPk

Jund wgEncodeHaibTfbsA549JundV0416102Etoh02PkRep1

wgEncodeHaibTfbsA549JundV0416102Etoh02PkRep2

wgEncodeHaibTfbsH1hescJundV0416102PkRep1

110 wgEncodeHaibTfbsH1hescJundV0416102PkRep2

wgEncodeHaibTfbsHct116JundV0422111PkRep1

wgEncodeHaibTfbsHct116JundV0422111PkRep2

wgEncodeHaibTfbsHepg2JundPcr1xPkRep1

wgEncodeHaibTfbsHepg2JundPcr1xPkRep2

wgEncodeHaibTfbsMcf7JundV0422111PkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsMcf7JundV0422111PkRep2

wgEncodeHaibTfbsSknshJundV0422111PkRep1

wgEncodeHaibTfbsSknshJundV0422111PkRep2

wgEncodeHaibTfbsT47dJundV0422111PkRep1

111 wgEncodeHaibTfbsT47dJundV0422111PkRep2

wgEncodeSydhTfbsGm12878JundIggrabPk

wgEncodeSydhTfbsGm12878JundStdPk

wgEncodeSydhTfbsH1hescJundIggrabPk

wgEncodeSydhTfbsHelas3JundIggrabPk

wgEncodeSydhTfbsHepg2JundIggrabPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeSydhTfbsK562JundIggrabPk

wgEncodeSydhTfbsSknshJundIggrabPk

MAX Max wgEncodeHaibTfbsA549MaxV0422111PkRep1

wgEncodeHaibTfbsA549MaxV0422111PkRep2

112 wgEncodeHaibTfbsEcc1MaxV0422111PkRep1

wgEncodeHaibTfbsEcc1MaxV0422111PkRep2

wgEncodeHaibTfbsH1hescMaxV0422111PkRep1

wgEncodeHaibTfbsH1hescMaxV0422111PkRep2

wgEncodeHaibTfbsHct116MaxV0422111PkRep1

wgEncodeHaibTfbsHct116MaxV0422111PkRep2

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsHepg2MaxV0422111PkRep1

wgEncodeHaibTfbsHepg2MaxV0422111PkRep2

wgEncodeHaibTfbsK562MaxV0416102PkRep1

wgEncodeHaibTfbsK562MaxV0416102PkRep2

113 wgEncodeHaibTfbsMcf7MaxV0422111PkRep1

wgEncodeHaibTfbsMcf7MaxV0422111PkRep2

wgEncodeHaibTfbsSknshMaxV0422111PkRep1

wgEncodeHaibTfbsSknshMaxV0422111PkRep2

wgEncodeSydhTfbsA549MaxIggrabPk

wgEncodeSydhTfbsGm12878MaxIggmusPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeSydhTfbsGm12878MaxStdPk

wgEncodeSydhTfbsH1hescMaxUcdPk

wgEncodeSydhTfbsHelas3MaxIggrabPk

wgEncodeSydhTfbsHelas3MaxStdPk

114 wgEncodeSydhTfbsHepg2MaxIggrabPk

wgEncodeSydhTfbsHuvecMaxStdPk

wgEncodeSydhTfbsK562MaxIggrabPk

wgEncodeSydhTfbsK562MaxStdPk

wgEncodeSydhTfbsNb4MaxStdPk

MEF2A Mef2a wgEncodeHaibTfbsGm12878Mef2aPcr1xPkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsGm12878Mef2aPcr1xPkRep2

wgEncodeHaibTfbsK562Mef2aV0416101PkRep1

wgEncodeHaibTfbsK562Mef2aV0416101PkRep2

wgEncodeHaibTfbsSknshMef2aV0422111PkRep1

115 wgEncodeHaibTfbsSknshMef2aV0422111PkRep2

MYC Cmyc wgEncodeSydhTfbsA549CmycIggrabPk

wgEncodeSydhTfbsH1hescCmycIggrabPk

wgEncodeSydhTfbsHelas3CmycStdPk

wgEncodeSydhTfbsK562CmycIfna30StdPk

wgEncodeSydhTfbsK562CmycIfna6hStdPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeSydhTfbsK562CmycIfng30StdPk

wgEncodeSydhTfbsK562CmycIfng6hStdPk

wgEncodeSydhTfbsK562CmycIggrabPk

wgEncodeSydhTfbsK562CmycStdPk

116 wgEncodeSydhTfbsMcf10aesCmycEtoh01HvdPk

wgEncodeSydhTfbsMcf10aesCmycTam14hHvdPk

wgEncodeSydhTfbsNb4CmycStdPk

NFATC1Nfatc1sc17834 wgEncodeHaibTfbsGm12878Nfatc1sc17834V0422111PkRep1

wgEncodeHaibTfbsGm12878Nfatc1sc17834V0422111PkRep2

NFE2 Nfe2 wgEncodeSydhTfbsK562Nfe2StdPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

Nfe2sc22827 wgEncodeSydhTfbsGm12878Nfe2sc22827StdPk

NFIC Nﬁcsc81335 wgEncodeHaibTfbsEcc1Nﬁcsc81335V0422111PkRep1

wgEncodeHaibTfbsEcc1Nﬁcsc81335V0422111PkRep2

wgEncodeHaibTfbsGm12878Nﬁcsc81335V0422111PkRep1

117 wgEncodeHaibTfbsGm12878Nﬁcsc81335V0422111PkRep2

wgEncodeHaibTfbsHepg2Nﬁcsc81335V0422111PkRep1

wgEncodeHaibTfbsHepg2Nﬁcsc81335V0422111PkRep2

wgEncodeHaibTfbsSknshNﬁcsc81335V0422111PkRep1

wgEncodeHaibTfbsSknshNﬁcsc81335V0422111PkRep2

NFYA Nfya wgEncodeSydhTfbsGm12878NfyaIggmusPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeSydhTfbsHelas3NfyaIggrabPk

wgEncodeSydhTfbsK562NfyaStdPk

NFYB Nfyb wgEncodeSydhTfbsGm12878NfybIggmusPk

wgEncodeSydhTfbsHelas3NfybIggrabPk

118 wgEncodeSydhTfbsK562NfybStdPk

NR2F2 Nr2f2sc271940 wgEncodeHaibTfbsHepg2Nr2f2sc271940V0422111PkRep1

wgEncodeHaibTfbsHepg2Nr2f2sc271940V0422111PkRep2

wgEncodeHaibTfbsK562Nr2f2sc271940V0422111PkRep1

wgEncodeHaibTfbsK562Nr2f2sc271940V0422111PkRep2

wgEncodeHaibTfbsMcf7Nr2f2sc271940V0422111PkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsMcf7Nr2f2sc271940V0422111PkRep2

NR3C1 Gr wgEncodeHaibTfbsA549GrPcr1xDex500pmPkRep1

wgEncodeHaibTfbsA549GrPcr1xDex500pmPkRep2

wgEncodeHaibTfbsA549GrPcr1xDex50nmPkRep1

119 wgEncodeHaibTfbsA549GrPcr1xDex50nmPkRep2

wgEncodeHaibTfbsA549GrPcr1xDex5nmPkRep1

wgEncodeHaibTfbsA549GrPcr1xDex5nmPkRep2

wgEncodeHaibTfbsA549GrPcr2xDex100nmPkRep1

wgEncodeHaibTfbsA549GrPcr2xDex100nmPkRep2

wgEncodeHaibTfbsEcc1GrV0416102Dex100nmPkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsEcc1GrV0416102Dex100nmPkRep2

Grp20 wgEncodeSydhTfbsHepg2Grp20ForsklnStdPk

PAX5 Pax5c20 wgEncodeHaibTfbsGm12878Pax5c20Pcr1xPkRep1

wgEncodeHaibTfbsGm12878Pax5c20Pcr1xPkRep2

120 wgEncodeHaibTfbsGm12891Pax5c20V0416101PkRep1

wgEncodeHaibTfbsGm12891Pax5c20V0416101PkRep2

wgEncodeHaibTfbsGm12892Pax5c20V0416101PkRep1

wgEncodeHaibTfbsGm12892Pax5c20V0416101PkRep2

Pax5n19 wgEncodeHaibTfbsGm12878Pax5n19Pcr1xPkRep1

wgEncodeHaibTfbsGm12878Pax5n19Pcr1xPkRep2

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

POU2F2 Pou2f2 wgEncodeHaibTfbsGm12878Pou2f2Pcr1xPkRep1

wgEncodeHaibTfbsGm12878Pou2f2Pcr1xPkRep2

wgEncodeHaibTfbsGm12878Pou2f2Pcr1xPkRep3

wgEncodeHaibTfbsGm12891Pou2f2Pcr1xPkRep1

121 wgEncodeHaibTfbsGm12891Pou2f2Pcr1xPkRep2

RELA Nfkb wgEncodeSydhTfbsGm10847NfkbTnfaIggrabPk

wgEncodeSydhTfbsGm12878NfkbTnfaIggrabPk

wgEncodeSydhTfbsGm12891NfkbTnfaIggrabPk

wgEncodeSydhTfbsGm12892NfkbTnfaIggrabPk

wgEncodeSydhTfbsGm15510NfkbTnfaIggrabPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeSydhTfbsGm18505NfkbTnfaIggrabPk

wgEncodeSydhTfbsGm18526NfkbTnfaIggrabPk

wgEncodeSydhTfbsGm18951NfkbTnfaIggrabPk

wgEncodeSydhTfbsGm19099NfkbTnfaIggrabPk

122 wgEncodeSydhTfbsGm19193NfkbTnfaIggrabPk

RXRA Rxra wgEncodeHaibTfbsGm12878RxraPcr1xPkRep1

wgEncodeHaibTfbsGm12878RxraPcr1xPkRep2

wgEncodeHaibTfbsH1hescRxraV0416102PkRep1

wgEncodeHaibTfbsH1hescRxraV0416102PkRep2

wgEncodeHaibTfbsHepg2RxraPcr1xPkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsHepg2RxraPcr1xPkRep2

wgEncodeHaibTfbsSknshRxraV0422111PkRep1

wgEncodeHaibTfbsSknshRxraV0422111PkRep2

SP1 Sp1 wgEncodeHaibTfbsA549Sp1V0422111Etoh02PkRep1

123 wgEncodeHaibTfbsA549Sp1V0422111Etoh02PkRep2

wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1

wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep2

wgEncodeHaibTfbsH1hescSp1Pcr1xPkRep1

wgEncodeHaibTfbsH1hescSp1Pcr1xPkRep2

wgEncodeHaibTfbsHct116Sp1V0422111PkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsHct116Sp1V0422111PkRep2

wgEncodeHaibTfbsHepg2Sp1Pcr1xPkRep1

wgEncodeHaibTfbsHepg2Sp1Pcr1xPkRep2

wgEncodeHaibTfbsK562Sp1Pcr1xPkRep1

124 wgEncodeHaibTfbsK562Sp1Pcr1xPkRep2

SPI1 Pu1 wgEncodeHaibTfbsGm12878Pu1Pcr1xPkRep1

wgEncodeHaibTfbsGm12878Pu1Pcr1xPkRep2

wgEncodeHaibTfbsGm12878Pu1Pcr1xPkRep3

wgEncodeHaibTfbsGm12891Pu1Pcr1xPkRep1

wgEncodeHaibTfbsGm12891Pu1Pcr1xPkRep2

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsHl60Pu1V0422111PkRep1

wgEncodeHaibTfbsHl60Pu1V0422111PkRep2

wgEncodeHaibTfbsK562Pu1Pcr1xPkRep1

wgEncodeHaibTfbsK562Pu1Pcr1xPkRep2

125 SREBF1 Srebp1 wgEncodeSydhTfbsGm12878Srebp1IggrabPk

wgEncodeSydhTfbsHepg2Srebp1InslnStdPk

wgEncodeSydhTfbsHepg2Srebp1PravastStdPk

SRF Srf wgEncodeHaibTfbsEcc1SrfV0422111PkRep1

wgEncodeHaibTfbsEcc1SrfV0422111PkRep2

wgEncodeHaibTfbsGm12878SrfPcr2xPkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsGm12878SrfPcr2xPkRep2

wgEncodeHaibTfbsGm12878SrfV0416101PkRep1

wgEncodeHaibTfbsGm12878SrfV0416101PkRep2

wgEncodeHaibTfbsH1hescSrfPcr1xPkRep1

126 wgEncodeHaibTfbsH1hescSrfPcr1xPkRep2

wgEncodeHaibTfbsHct116SrfV0422111PkRep1

wgEncodeHaibTfbsHct116SrfV0422111PkRep2

wgEncodeHaibTfbsHepg2SrfV0416101PkRep1

wgEncodeHaibTfbsHepg2SrfV0416101PkRep2

wgEncodeHaibTfbsK562SrfV0416101PkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsK562SrfV0416101PkRep2

wgEncodeHaibTfbsMcf7SrfV0422111PkRep1

wgEncodeHaibTfbsMcf7SrfV0422111PkRep2

STAT1 Stat1 wgEncodeSydhTfbsGm12878Stat1StdPk

127 wgEncodeSydhTfbsHelas3Stat1Ifng30StdPk

wgEncodeSydhTfbsK562Stat1Ifna30StdPk

wgEncodeSydhTfbsK562Stat1Ifna6hStdPk

wgEncodeSydhTfbsK562Stat1Ifng30StdPk

wgEncodeSydhTfbsK562Stat1Ifng6hStdPk

STAT2 Stat2 wgEncodeSydhTfbsK562Stat2Ifna30StdPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeSydhTfbsK562Stat2Ifna6hStdPk

STAT3 Stat3 wgEncodeSydhTfbsGm12878Stat3IggmusPk

wgEncodeSydhTfbsHelas3Stat3IggrabPk

wgEncodeSydhTfbsMcf10aesStat3Etoh01StdPk

128 wgEncodeSydhTfbsMcf10aesStat3Etoh01bStdPk

wgEncodeSydhTfbsMcf10aesStat3Etoh01cStdPk

wgEncodeSydhTfbsMcf10aesStat3Tam112hHvdPk

wgEncodeSydhTfbsMcf10aesStat3TamStdPk

STAT5A Stat5asc74442 wgEncodeHaibTfbsGm12878Stat5asc74442V0422111PkRep1

wgEncodeHaibTfbsGm12878Stat5asc74442V0422111PkRep2

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsK562Stat5asc74442V0422111PkRep1

wgEncodeHaibTfbsK562Stat5asc74442V0422111PkRep2

TBP Tbp wgEncodeSydhTfbsGm12878TbpIggmusPk

wgEncodeSydhTfbsH1hescTbpIggrabPk

129 wgEncodeSydhTfbsHelas3TbpIggrabPk

wgEncodeSydhTfbsHepg2TbpIggrabPk

wgEncodeSydhTfbsK562TbpIggmusPk

TCF3 Tcf3 wgEncodeHaibTfbsGm12878Tcf3Pcr1xPkRep1

wgEncodeHaibTfbsGm12878Tcf3Pcr1xPkRep2

USF1 Usf1 wgEncodeHaibTfbsA549Usf1Pcr1xDex100nmPkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsA549Usf1Pcr1xDex100nmPkRep2

wgEncodeHaibTfbsA549Usf1Pcr1xEtoh02PkRep1

wgEncodeHaibTfbsA549Usf1Pcr1xEtoh02PkRep2

wgEncodeHaibTfbsA549Usf1V0422111Etoh02PkRep1

130 wgEncodeHaibTfbsA549Usf1V0422111Etoh02PkRep2

wgEncodeHaibTfbsEcc1Usf1V0422111PkRep1

wgEncodeHaibTfbsEcc1Usf1V0422111PkRep2

wgEncodeHaibTfbsGm12878Usf1Pcr2xPkRep1

wgEncodeHaibTfbsGm12878Usf1Pcr2xPkRep2

wgEncodeHaibTfbsH1hescUsf1Pcr1xPkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsH1hescUsf1Pcr1xPkRep2

wgEncodeHaibTfbsHct116Usf1V0422111PkRep1

wgEncodeHaibTfbsHct116Usf1V0422111PkRep2

wgEncodeHaibTfbsHepg2Usf1Pcr1xPkRep1

131 wgEncodeHaibTfbsHepg2Usf1Pcr1xPkRep2

wgEncodeHaibTfbsK562Usf1V0416101PkRep1

wgEncodeHaibTfbsK562Usf1V0416101PkRep2

wgEncodeHaibTfbsSknshUsf1V0422111PkRep1

wgEncodeHaibTfbsSknshUsf1V0422111PkRep2

Usf1sc8983 wgEncodeHaibTfbsSknshraUsf1sc8983V0416102PkRep1

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsSknshraUsf1sc8983V0416102PkRep2

YY1 Yy1 wgEncodeHaibTfbsGm12892Yy1V0416101PkRep1

wgEncodeHaibTfbsGm12892Yy1V0416101PkRep2

wgEncodeHaibTfbsK562Yy1V0416101PkRep1

132 wgEncodeHaibTfbsK562Yy1V0416101PkRep2

wgEncodeHaibTfbsK562Yy1V0416102PkRep1

wgEncodeHaibTfbsK562Yy1V0416102PkRep2

wgEncodeSydhTfbsGm12878Yy1StdPk

wgEncodeSydhTfbsK562Yy1UcdPk

wgEncodeSydhTfbsNt2d1Yy1UcdPk

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

Yy1sc281 wgEncodeHaibTfbsEcc1Yy1sc281V0422111PkRep1

wgEncodeHaibTfbsEcc1Yy1sc281V0422111PkRep2

wgEncodeHaibTfbsGm12878Yy1sc281Pcr1xPkRep1

wgEncodeHaibTfbsGm12878Yy1sc281Pcr1xPkRep2

133 wgEncodeHaibTfbsGm12891Yy1sc281V0416101PkRep1

wgEncodeHaibTfbsGm12891Yy1sc281V0416101PkRep2

wgEncodeHaibTfbsH1hescYy1sc281V0416102PkRep1

wgEncodeHaibTfbsH1hescYy1sc281V0416102PkRep2

wgEncodeHaibTfbsHct116Yy1sc281V0416101PkRep1

wgEncodeHaibTfbsHct116Yy1sc281V0416101PkRep2

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

wgEncodeHaibTfbsHepg2Yy1sc281V0416101PkRep1

wgEncodeHaibTfbsHepg2Yy1sc281V0416101PkRep2

wgEncodeHaibTfbsK562Yy1sc281V0416101PkRep1

wgEncodeHaibTfbsK562Yy1sc281V0416101PkRep2

134 wgEncodeHaibTfbsSknshYy1sc281V0422111PkRep1

wgEncodeHaibTfbsSknshYy1sc281V0422111PkRep2

wgEncodeHaibTfbsSknshraYy1sc281V0416102PkRep1

wgEncodeHaibTfbsSknshraYy1sc281V0416102PkRep2

ZEB1 Zeb1 wgEncodeHaibTfbsHepg2Zeb1V0422111PkRep1

wgEncodeHaibTfbsHepg2Zeb1V0422111PkRep2

Continued on next page Table 2.3.11 – continued from previous page

TF Antibody Track

Zeb1sc25388 wgEncodeHaibTfbsGm12878Zeb1sc25388V0416102PkRep1

wgEncodeHaibTfbsGm12878Zeb1sc25388V0416102PkRep2 135 Table 2.3.12: Mouse ENCODE Tracks Used for Validating TF Models

TF Antibody Track

Cebpb Cebpb wgEncodeCaltechTfbsC2c12CebpbFCntrl50bE2p60hPcr1xPkRep1

wgEncodeCaltechTfbsC2c12CebpbFCntrl50bE2p60hPcr1xPkRep2

wgEncodeCaltechTfbsC2c12CebpbFCntrl50bPcr1xPkRep1

Ets1 Ets1 wgEncodeSydhTfbsCh12Ets1IggrabPk 136 wgEncodeSydhTfbsMelEts1IggrabPk

Gata1 Gata1 wgEncodeSydhTfbsMelGata1Dm2p5dStdPk

wgEncodeSydhTfbsMelGata1IggratPk

Jun Cjun wgEncodeSydhTfbsCh12CjunIggrabPk

Jund Jund wgEncodeSydhTfbsCh12JundIggrabPk

wgEncodeSydhTfbsMelJundIggrabPk

Continued on next page Table 2.3.12 – continued from previous page

TF Antibody Track

Mafk Mafk wgEncodeSydhTfbsEse14MafkStdPk

wgEncodeSydhTfbsMelMafkDm2p5dStdPk

Mafkab50322wgEncodeSydhTfbsCh12Mafkab50322IggrabPk

wgEncodeSydhTfbsMelMafkab50322IggrabPk

137 Myb Cmybh141 wgEncodeSydhTfbsMelCmybh141IggrabPk

Cmybsc7874 wgEncodeSydhTfbsMelCmybsc7874IggrabPk

Myod1Sc32758 wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bE2p24hPcr2xPkRep1

wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bE2p60hPcr2xPkRep1

wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bPcr2xPkRep1

wgEncodeCaltechTfbsC2c12Sc32758FCntrl50bE2p7dPcr1xPkRep1

Continued on next page Table 2.3.12 – continued from previous page

TF Antibody Track

Myog Sc12732 wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bE2p24hPcr2xPkRep1

wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bE2p60hPcr2xPkRep1

wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bPcr2xPkRep1

wgEncodeCaltechTfbsC2c12Sc12732FCntrl50bE2p7dPcr1xPkRep1

138 Pax5 Pax5c wgEncodePsuTfbsCh12Pax5cFImmortal2a4bInputPk

Srf Srf wgEncodeCaltechTfbsC2c12SrfFCntrl32bE2p24hPcr2xPkRep1

Tcf3 Tcf3 wgEncodeCaltechTfbsC2c12Tcf3FCntrl32bE2p5dPcr2xPkRep1

Usf1 Usf1 wgEncodeCaltechTfbsC2c12Usf1FCntrl50bE2p60hPcr1xPkRep1

wgEncodeCaltechTfbsC2c12Usf1FCntrl50bPcr1xPkRep1 Evaluating a model based on the top-1000 hits is analogous to evaluating a search engine based on the top-1000 documents. Therefore, we used average precision [84] to score a model. This performance measure is widely used

P1000 in the information retrieval community and is deﬁned as P(k) tp(k)/c. k=1 × P(k) gives the precision based on the top k hits (fraction of the top k hits that are true positives). Indicator tp(k) is 1 if hit k is a true positive. Otherwise, tp(k) is 0. The denominator c is the portion of upstream regions covered by peaks in bases and was computed based on all the ChIP-seq experiments used to validate the model. We also scored each model by accuracy, which is equivalent to P(1000).

The performance of LASAGNA-Search on a TF was measured by the average score of the associated models and likewise for MAPPER2. Results for LASAGNA-Search are listed in Tables 2.3.13 and 2.3.14, while results for MAPPER2 are listed in Tables 2.3.15 and 2.3.16. Average precision and accuracy are given in individual columns. Each row presents the performance of a model in predicting the binding sites of a TF. Figure 2.3.4 shows the comparison between LASAGNA-Search and MAPPER2 in terms of average precision. The same comparison in terms of accuracy is shown in

Figure 2.3.5.

139 Table 2.3.13: Performance of LASAGNA-Search TF Models Validated by ENCODE Human ChIP-seq Data

TF Collection Model ID Model Name AccuracyAve Prec1

ATF1 TRANSFAC M00017 ATF 0.339 5.973

(PWM-based) ATF1 TRANSFAC T00051 ATF 0.374 4.895

(TFBS-based)

140 ATF3 TRANSFAC M00017 ATF 0.279 2.554

(PWM-based) ATF3 TRANSFAC T00051 ATF 0.302 2.184

(TFBS-based) CEBPB TRANSFAC M00117 C/EBPbeta 0.484 2.079

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

CEBPB TRANSFAC M00109 C/EBPbeta 0.364 1.196

(PWM-based) CEBPB TRANSFAC T00581 NF-IL6-2 0.12 0.098

(TFBS-based) CREB1 TRANSFAC M00178 CREB 0.499 3.007 141

(PWM-based) CREB1 TRANSFAC M00113 CREB 0.409 2.301

(PWM-based) CREB1 TRANSFAC M00115 Tax/CREB 0.095 2.301

(PWM-based) CREB1 TRANSFAC M00114 Tax/CREB 0.222 2.006

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

CREB1 TRANSFAC M00177 CREB 0.443 1.925

(PWM-based) CREB1 JASPAR CORE MA0018.1 CREB1 0.309 1.250

CREB1 TRANSFAC M00039 CREB 0.482 1.019

142 (PWM-based) CREB1 JASPAR CORE MA0018.2 CREB1 0.482 0.517

CREB1 TRANSFAC T00163 CREB 0.348 0.106

(TFBS-based) ATF2 TRANSFAC M00179 CRE-BP1 0.399 3.058

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

ATF2 TRANSFAC M00017 ATF 0.327 3.046

(PWM-based) ATF2 TRANSFAC T00051 ATF 0.371 2.345

(TFBS-based) ATF2 TRANSFAC T00167 ATF-2 0.444 2.136 143

(TFBS-based) ATF2 TRANSFAC M00041 CRE-BP1:c-Jun 0.379 1.822

(PWM-based) ATF2 TRANSFAC M00040 CRE-BP1 0.124 0.198

(PWM-based) CUX1 TRANSFAC M00095 CDP 0.054 0.883

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

CUX1 TRANSFAC M00104 CDP CR1 0.193 0.526

(PWM-based) CUX1 TRANSFAC M00102 CDP 0.149 0.412

(PWM-based) CUX1 TRANSFAC M00105 CDP CR3 0.034 0.083 144

(PWM-based) CUX1 TRANSFAC M00106 CDP CR3+HD 0.151 0.028

(PWM-based) E2F1 TRANSFAC T01542 E2F-1 0.324 2.167

(TFBS-based) E2F1 TRANSFAC M00024 E2F 0.185 2.122

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

E2F1 TRANSFAC T00221 E2F 0.326 1.878

(TFBS-based) E2F1 TRANSFAC M00516 E2F 0.312 1.493

(PWM-based) E2F1 TRANSFAC M00050 E2F 0.317 1.493 145

(PWM-based) E2F1 JASPAR CORE MA0024.1 E2F1 0.317 0.758

E2F4 TRANSFAC M00024 E2F 0.221 2.186

(PWM-based) E2F4 TRANSFAC T00221 E2F 0.366 1.780

(TFBS-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

E2F4 TRANSFAC M00516 E2F 0.33 1.295

(PWM-based) E2F4 TRANSFAC M00050 E2F 0.325 0.932

(PWM-based) ELK1 JASPAR CORE MA0028.1 ELK1 0.174 2.392 146

ELK1 TRANSFAC T00250 Elk-1 0.152 0.946

(TFBS-based) ELK1 TRANSFAC M00007 Elk-1 0.157 0.587

(PWM-based) ELK1 TRANSFAC M00025 Elk-1 0.346 0.501

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

ELK4 JASPAR CORE MA0076.1 ELK4 0.322 6.723

EP300 TRANSFAC M00033 p300 0.15 0.150

(PWM-based) ESR1 TRANSFAC T00261 ER-alpha 0.057 9.634

147 (TFBS-based) ESR1 JASPAR CORE MA0112.1 ESR1 0.148 3.845

ESR1 JASPAR CORE MA0112.2 ESR1 0.264 2.198

ESR1 TRANSFAC M00191 ER 0.124 0.707

(PWM-based) ETS1 JASPAR CORE MA0098.1 ETS1 0.06 0.063

FOXM1 ORegAnno ORegAnno 9606 32 FOXM1 (HNF3) 0.068 0.064

Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

FOS TRANSFAC T00123 c-Fos 0.475 4.708

(TFBS-based) FOS UniPROBE UP00425.Jun+Fos Heterodimer.primary Jun/Fos Heterodimer 0.649 3.756

FOS TRANSFAC M00188 AP-1 0.4 2.827

148 (PWM-based) FOS TRANSFAC T00029 AP-1 0.502 2.807

(TFBS-based) FOS TRANSFAC M00172 AP-1 0.379 2.720

(PWM-based) FOS TRANSFAC M00517 AP-1 0.603 2.603

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

FOS TRANSFAC M00173 AP-1 0.377 1.814

(PWM-based) FOS TRANSFAC M00199 AP-1 0.509 1.664

(PWM-based) FOS JASPAR CORE MA0099.2 AP1 0.365 1.497 149

FOS TRANSFAC M00174 AP-1 0.52 1.399

(PWM-based) GABPA JASPAR CORE MA0062.1 GABPA 0.539 3.685

GATA1 TRANSFAC M00346 GATA-1 0.056 0.213

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

GATA1 TRANSFAC M00126 GATA-1 0.05 0.158

(PWM-based) GATA1 TRANSFAC M00128 GATA-1 0.073 0.103

(PWM-based) GATA1 TRANSFAC M00203 GATA-X 0.091 0.092 150

(PWM-based) GATA1 TRANSFAC T00306 GATA-1 0.064 0.072

(TFBS-based) GATA1 TRANSFAC M00127 GATA-1 0.032 0.025

(PWM-based) GATA3 TRANSFAC M00203 GATA-X 0.104 0.275

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

GATA3 TRANSFAC M00077 GATA-3 0.155 0.127

(PWM-based) GATA3 TRANSFAC T00311 GATA-3 0.068 0.095

(TFBS-based) GATA3 JASPAR CORE MA0037.1 GATA3 0.095 0.063 151

NR3C1 JASPAR CORE MA0113.1 NR3C1 0.125 1.158

NR3C1 TRANSFAC T01920 GR-beta 0.147 1.158

(TFBS-based) NR3C1 TRANSFAC T00337 GR-alpha 0.147 1.007

(TFBS-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

NR3C1 TRANSFAC M00192 GR 0.106 0.541

(PWM-based) NR3C1 TRANSFAC M00205 GR 0.093 0.409

(PWM-based) HNF4A JASPAR CORE MA0114.1 HNF4A 0.216 2.664 152

HNF4A ORegAnno ORegAnno 9606 8 HNF4A — HNF4A 0.073 2.477

(HNF4) HNF4A TRANSFAC M00158 COUP-TF, HNF-4 0.169 1.750

(PWM-based) HNF4A TRANSFAC M00134 HNF-4 0.222 0.490

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

HSF1 TRANSFAC M00146 HSF1 0.005 0.045

(PWM-based) IRF1 JASPAR CORE MA0050.1 IRF1 0.223 1.186

IRF1 TRANSFAC T00423 IRF-1 0.24 1.181

153 (TFBS-based) IRF1 TRANSFAC M00062 IRF-1 0.149 0.250

(PWM-based) JUN UniPROBE UP00425.Jun+Fos Heterodimer.primary Jun/Fos Heterodimer 0.387 2.923

JUN TRANSFAC M00517 AP-1 0.3 2.225

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

JUN TRANSFAC M00188 AP-1 0.186 1.196

(PWM-based) JUN TRANSFAC T00133 c-Jun 0.125 0.842

(TFBS-based) JUN TRANSFAC T00029 AP-1 0.2 0.771 154

(TFBS-based) JUN TRANSFAC M00172 AP-1 0.167 0.542

(PWM-based) JUN TRANSFAC M00199 AP-1 0.247 0.487

(PWM-based) JUN TRANSFAC M00041 CRE-BP1:c-Jun 0.481 0.447

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

JUN ORegAnno ORegAnno 9606 103 AP1 (Jun family) CRE — 0.188 0.447

Ap1(JUN) — JUN — AP1

— AP-1 JUN JASPAR CORE MA0099.2 AP1 0.188 0.408

155 JUN TRANSFAC M00173 AP-1 0.169 0.360

(PWM-based) JUN TRANSFAC M00174 AP-1 0.268 0.244

(PWM-based) JUNB TRANSFAC M00517 AP-1 0.26 1.291

(PWM-based) JUND TRANSFAC M00517 AP-1 0.627 1.688

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

MAX JASPAR CORE MA0058.1 MAX 0.241 1.409

MAX TRANSFAC M00118 c-Myc:Max 0.508 1.205

(PWM-based) MAX JASPAR CORE MA0059.1 MYC::MAX 0.568 0.883

156 MAX TRANSFAC M00615 c-Myc:Max 0.384 0.760

(PWM-based) MAX TRANSFAC M00123 c-Myc:Max 0.427 0.626

(PWM-based) MAX TRANSFAC M00119 Max 0.452 0.281

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

MAZ TRANSFAC T00490 MAZ 0.267 0.489

(TFBS-based) MEF2A TRANSFAC M00006 MEF-2 0.055 2.325

(PWM-based) MEF2A TRANSFAC M00231 MEF-2 0.159 1.889 157

(PWM-based) MEF2A TRANSFAC M00026 RSRFC4 0.137 1.705

(PWM-based) MEF2A TRANSFAC M00233 MEF-2 0.057 0.565

(PWM-based) MEF2A TRANSFAC M00232 MEF-2 0.146 0.374

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

MEF2A JASPAR CORE MA0052.1 MEF2A 0.091 0.324

MEF2A TRANSFAC T01005 MEF-2A 0.059 0.266

(TFBS-based) MYC TRANSFAC T00140 c-Myc 0.314 0.602

158 (TFBS-based) MYC TRANSFAC M00123 c-Myc:Max 0.272 0.551

(PWM-based) MYC TRANSFAC M00118 c-Myc:Max 0.302 0.432

(PWM-based) MYC TRANSFAC M00615 c-Myc:Max 0.242 0.352

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

NFATC1 TRANSFAC T01945 NF-AT2 0.023 0.023

(TFBS-based) NFATC1 TRANSFAC M00302 NF-AT 0.033 0.012

(PWM-based) NFE2 TRANSFAC M00037 NF-E2 0.095 3.073 159

(PWM-based) NFIC TRANSFAC M00193 NF-1 0.245 1.204

(PWM-based) NFIC JASPAR CORE MA0119.1 TLX1::NFIC 0.345 1.044

NFIC TRANSFAC M00056 myogenin / NF-1 0.139 0.641

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

NFIC TRANSFAC T00539 NF-1 0.326 0.547

(TFBS-based) NFIC ORegAnno ORegAnno 9606 34 NFIC (NF-I) — nf1 — NFI 0.242 0.440

— NFIC NFIC TRANSFAC T00174 CTF 0.207 0.211 160

(TFBS-based) NFIC JASPAR CORE MA0161.1 NFIC 0.077 0.048

NFYA TRANSFAC T00150 NF-Y 0.09 9.976

(TFBS-based) NFYA TRANSFAC M00287 NF-Y 0.502 6.903

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

NFYA TRANSFAC M00185 NF-Y 0.223 2.700

(PWM-based) NFYA TRANSFAC M00209 NF-Y 0.27 1.914

(PWM-based) NFYA JASPAR CORE MA0060.1 NFYA 0.428 0.416 161

NFYB TRANSFAC T00150 NF-Y 0.124 13.874

(TFBS-based) NFYB TRANSFAC M00287 NF-Y 0.643 4.967

(PWM-based) NFYB TRANSFAC M00209 NF-Y 0.393 3.195

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

NFYB TRANSFAC M00185 NF-Y 0.315 0.946

(PWM-based) PAX5 TRANSFAC M00144 BSAP 0.077 0.103

(PWM-based) PAX5 TRANSFAC M00143 BSAP 0.041 0.022 162

(PWM-based) POU2F2 TRANSFAC T00647 POU2F2 0.1 0.909

(TFBS-based) POU2F2 TRANSFAC M00210 OCT-x 0.226 0.211

(PWM-based) POU2F2 TRANSFAC T00646 POU2F2 (Oct-2.1) 0.032 0.014

(TFBS-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

RELA TRANSFAC M00052 NF-kappaB (p65) 0.135 0.905

(PWM-based) RELA TRANSFAC T00590 NF-kappaB 0.235 0.627

(TFBS-based) RELA TRANSFAC M00194 NF-kappaB 0.2 0.540 163

(PWM-based) RELA TRANSFAC M00054 NF-kappaB 0.191 0.526

(PWM-based) RELA TRANSFAC M00208 NF-kappaB 0.176 0.500

(PWM-based) RELA JASPAR CORE MA0107.1 RELA 0.195 0.344

Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

RXRA JASPAR CORE MA0065.1 PPARG::RXRA 0.076 0.328

RXRA JASPAR CORE MA0159.1 RXR::RAR DR5 0.07 0.285

RXRA JASPAR CORE MA0115.1 NR1H2::RXRA 0.069 0.254

RXRA JASPAR CORE MA0074.1 RXRA::VDR 0.034 0.058

164 RXRA TRANSFAC T01345 RXR-alpha 0.028 0.033

(TFBS-based) SP1 TRANSFAC T00759 Sp1 0.647 3.648

(TFBS-based) SP1 JASPAR CORE MA0079.2 SP1 0.045 1.716

SP1 ORegAnno ORegAnno 9606 2 Sp1 — SP1 (Sp-1) — 0.172 0.233

ENSG00000185591 — SP-1 Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

SP1 JASPAR CORE MA0079.1 SP1 0.028 0.028

SP1 TRANSFAC M00196 Sp1 0.459 0.018

(PWM-based) SP1 TRANSFAC M00008 Sp1 0.061 0.006

165 (PWM-based) SPI1 TRANSFAC T02068 PU.1 0.421 3.008

(TFBS-based) SPI1 JASPAR CORE MA0080.2 SPI1 0.262 1.038

SPI1 JASPAR CORE MA0080.1 SPI1 0.164 0.407

SREBF1 TRANSFAC M00220 SREBP-1 0.027 0.060

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

SREBF1 TRANSFAC M00221 SREBP-1 0.008 0.005

(PWM-based) SRF JASPAR CORE MA0083.1 SRF 0.247 3.486

SRF TRANSFAC M00152 SRF 0.248 2.705

166 (PWM-based) SRF TRANSFAC T00764 SRF 0.282 2.684

(TFBS-based) SRF TRANSFAC M00186 SRF 0.236 2.481

(PWM-based) SRF TRANSFAC M00215 SRF 0.274 2.053

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

STAT1 ORegAnno ORegAnno 9606 77 STAT1 0.218 3.255

STAT1 TRANSFAC M00223 STATx 0.227 1.220

(PWM-based) STAT1 TRANSFAC M00258 ISRE 0.081 1.182

167 (PWM-based) STAT1 JASPAR CORE MA0137.1 STAT1 0.121 1.101

STAT1 JASPAR CORE MA0137.2 STAT1 0.22 0.421

STAT1 TRANSFAC M00224 STAT1 0.393 0.345

(PWM-based) STAT1 TRANSFAC M00496 STAT1 0.147 0.186

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

STAT2 TRANSFAC M00223 STATx 0.014 0.968

(PWM-based) STAT2 TRANSFAC M00258 ISRE 0.027 0.173

(PWM-based) STAT3 TRANSFAC M00223 STATx 0.304 1.353 168

(PWM-based) STAT3 TRANSFAC M00497 STAT3 0.16 1.170

(PWM-based) STAT3 TRANSFAC M00225 STAT3 0.354 0.274

(PWM-based) STAT5A TRANSFAC M00460 STAT5A (homotetramer) 0.102 0.157

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

STAT5A TRANSFAC M00457 STAT5A (homodimer) 0.111 0.154

(PWM-based) STAT5A TRANSFAC M00499 STAT5A 0.049 0.029

(PWM-based) TAL1 JASPAR CORE MA0091.1 TAL1::TCF3 0.037 0.062 169

TAL1 TRANSFAC M00065 Tal-1beta:E47 0.036 0.055

(PWM-based) TAL1 TRANSFAC M00066 Tal-1alpha:E47 0.033 0.044

(PWM-based) TAL1 TRANSFAC M00070 Tal-1beta:ITF-2 0.046 0.040

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

TBP TRANSFAC T00820 TFIID 0.045 0.244

(TFBS-based) TBP TRANSFAC M00252 TATA 0.156 0.181

(PWM-based) TBP TRANSFAC M00216 TATA 0.063 0.039 170

(PWM-based) TBP TRANSFAC T00794 TBP 0.072 0.030

(TFBS-based) TBP TRANSFAC M00471 TBP 0.159 0.020

(PWM-based) TCF3 TRANSFAC M00065 Tal-1beta:E47 0.023 0.281

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

TCF3 TRANSFAC M00071 E47 0.024 0.140

(PWM-based) TCF3 TRANSFAC T00204 E12 0.098 0.025

(TFBS-based) TCF3 TRANSFAC M00222 Hand1:E47 0.006 0.025 171

(PWM-based) TCF3 JASPAR CORE MA0091.1 TAL1::TCF3 0.03 0.021

TCF3 TRANSFAC M00002 E47 0.07 0.013

(PWM-based) TCF3 TRANSFAC M00066 Tal-1alpha:E47 0.023 0.001

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

ZEB1 TRANSFAC M00414 AREB6 0.046 0.175

(PWM-based) ZEB1 TRANSFAC M00412 AREB6 0.023 0.121

(PWM-based) ZEB1 TRANSFAC T00625 ZEB (1124 AA) 0.075 0.058 172

(TFBS-based) ZEB1 TRANSFAC M00413 AREB6 0.044 0.025

(PWM-based) ZEB1 TRANSFAC M00415 AREB6 0.012 0.004

(PWM-based) NR2F2 TRANSFAC T00045 COUP-TF2 0.107 0.131

(TFBS-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

NR2F2 TRANSFAC M00155 ARP-1 0.053 0.053

(PWM-based) USF1 TRANSFAC M00187 USF 0.756 8.974

(PWM-based) USF1 TRANSFAC T00874 USF1 0.543 5.127 173

(TFBS-based) USF1 TRANSFAC M00217 USF 0.395 2.301

(PWM-based) USF1 JASPAR CORE MA0093.1 USF1 0.395 2.300

USF1 TRANSFAC M00122 USF 0.409 2.300

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

USF1 TRANSFAC M00121 USF 0.345 1.678

(PWM-based) YY1 TRANSFAC M00069 YY1 0.288 0.755

(PWM-based) YY1 TRANSFAC T00915 YY1 0.249 0.566 174

(TFBS-based) YY1 TRANSFAC M00059 YY1 0.15 0.194

(PWM-based) YY1 JASPAR CORE MA0095.1 YY1 0.158 0.189

FOSL1 TRANSFAC M00517 AP-1 0.226 3.039

(PWM-based) Continued on next page Table 2.3.13 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

CTCF ORegAnno ORegAnno 9606 74 CTCF 0.989 6.907

CTCF JASPAR CORE MA0139.1 CTCF 0.767 4.733

1 5 Average Precision (10− ) 175 Table 2.3.14: Performance of LASAGNA-Search TF Models Validated by ENCODE Mouse ChIP-seq Data

TF Collection Model ID Model Name AccuracyAve Prec1

Cebpb TRANSFAC (TFBS-based) T00017 C/EBPbeta(p35) 0.063 4.707

Cebpb TRANSFAC (PWM-based) M00117 C/EBPbeta 0.151 2.453

Cebpb TRANSFAC (PWM-based) M00109 C/EBPbeta 0.084 1.260

Fli1 UniPROBE UP00416.Fli1.primary Fli1 0.205 7.963 176 Gata1 JASPAR CORE MA0035.2 Gata1 0.159 0.566

Gata1 TRANSFAC (PWM-based) M00346 GATA-1 0.147 0.527

Gata1 JASPAR CORE MA0140.1 Tal1::Gata1 0.154 0.502

Gata1 JASPAR CORE MA0035.1 Gata1 0.067 0.477

Gata1 TRANSFAC (TFBS-based) T00305 GATA-1 0.162 0.441

Gata1 TRANSFAC (PWM-based) M00128 GATA-1 0.124 0.318

Continued on next page Table 2.3.14 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

Gata1 TRANSFAC (PWM-based) M00203 GATA-X 0.171 0.123

Gata1 TRANSFAC (PWM-based) M00126 GATA-1 0.084 0.095

Gata1 TRANSFAC (PWM-based) M00127 GATA-1 0.054 0.083

Gata1 TRANSFAC (PWM-based) M00075 GATA-1 0.075 0.050

177 Jun TRANSFAC (PWM-based) M00174 AP-1 0.022 1.884

Jun TRANSFAC (TFBS-based) T00032 AP-1 0.027 0.262

Jun TRANSFAC (PWM-based) M00172 AP-1 0.027 0.225

Jun TRANSFAC (PWM-based) M00188 AP-1 0.028 0.120

Jun TRANSFAC (PWM-based) M00517 AP-1 0.049 0.111

Jun TRANSFAC (PWM-based) M00199 AP-1 0.047 0.088

Continued on next page Table 2.3.14 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

Jun TRANSFAC (PWM-based) M00041 CRE-BP1:c-Jun 0.136 0.085

Jun TRANSFAC (PWM-based) M00173 AP-1 0.035 0.060

Jund TRANSFAC (PWM-based) M00517 AP-1 0.053 0.283

Mafk UniPROBE UP00044.Mafk.primary Mafk 0.181 9.666

178 Mafk TRANSFAC (PWM-based) M00037 NF-E2 0.449 2.187

Mafk UniPROBE UP00044.Mafk.secondary Mafk 0.023 0.959

Mafk TRANSFAC (TFBS-based) T00557 NF-E2 0.158 0.028

Myb TRANSFAC (TFBS-based) T00138 c-Myb 0.011 0.112

Myb UniPROBE UP00092.Myb.secondary Myb 0.018 0.109

Myb TRANSFAC (PWM-based) M00183 c-Myb 0.002 0.093

Continued on next page Table 2.3.14 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

Myb UniPROBE UP00092.Myb.primary Myb 0.023 0.023

Myb TRANSFAC (PWM-based) M00004 c-Myb 0.009 0.014

Myb JASPAR CORE MA0110.1 ATHB-5 0.002 0.001

Myb JASPAR CORE MA0100.1 Myb 0.009 0.001

179 Myod1 TRANSFAC (PWM-based) M00001 MyoD 0.096 0.675

Myod1 TRANSFAC (TFBS-based) T00526 MyoD 0.124 0.398

Myod1 TRANSFAC (PWM-based) M00184 MyoD 0.098 0.332

Myog TRANSFAC (TFBS-based) T00528 myogenin 0.159 0.859

Pax5 TRANSFAC (PWM-based) M00144 BSAP 0.001 0.066

Pax5 JASPAR CORE MA0014.1 Pax5 0.004 0.024

Continued on next page Table 2.3.14 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

Pax5 TRANSFAC (PWM-based) M00143 BSAP 0 0.000

Srf TRANSFAC (TFBS-based) T00765 SRF (504 AA) 0.076 1.888

Srf TRANSFAC (PWM-based) M00215 SRF 0.078 1.817

Srf TRANSFAC (PWM-based) M00152 SRF 0.077 1.348

180 Srf UniPROBE UP00077.Srf.primary Srf 0.055 1.245

Srf TRANSFAC (PWM-based) M00186 SRF 0.079 0.557

Srf UniPROBE UP00077.Srf.secondary Srf 0.002 0.001

Tbp UniPROBE UP00029.Tbp.primary Tbp 0.06 0.187

Tbp UniPROBE UP00029.Tbp.secondary Tbp 0.096 0.168

Tbp TRANSFAC (PWM-based) M00252 TATA 0.085 0.108

Continued on next page Table 2.3.14 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

Tbp TRANSFAC (PWM-based) M00471 TBP 0.13 0.058

Tbp TRANSFAC (PWM-based) M00216 TATA 0.074 0.058

Tcf3 JASPAR CORE MA0092.1 Hand1::Tcfe2a 0.004 0.015

Tcf3 UniPROBE UP00046.Tcfe2a.secondary Tcfe2a 0.007 0.010

181 Tcf3 UniPROBE UP00046.Tcfe2a.primary Tcfe2a 0.008 0.006

Usf1 TRANSFAC (PWM-based) M00122 USF 0.23 16.555

Usf1 TRANSFAC (PWM-based) M00217 USF 0.182 8.406

Usf1 TRANSFAC (PWM-based) M00187 USF 0.391 5.904

Usf1 TRANSFAC (PWM-based) M00121 USF 0.276 3.613

Ets1 TRANSFAC (TFBS-based) T00111 c-Ets-1 0.134 0.458

Continued on next page Table 2.3.14 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

Ets1 UniPROBE UP00414.Ets1.primary Ets1 0.113 0.266

Ets1 TRANSFAC (PWM-based) M00032 c-Ets-1(p54) 0.131 0.260

1 5 Average Precision (10− ) 182 Table 2.3.15: Performance of MAPPER2 TF Models Validated by ENCODE Human ChIP-seq Data

TF Collection Model ID Model Name AccuracyAve Prec1

ATF1 MAPPER T00968 ATF-1 0.036 5.129

ATF1 TRANSFAC M00981 CREB, ATF 0.204 4.308

ATF1 TRANSFAC M00017 ATF 0.343 1.504

ATF1 TRANSFAC M00691 ATF1 0.404 0.300 183 ATF3 TRANSFAC M00513 ATF3 0.354 4.616

ATF3 MAPPER T01313 ATF3 0.08 2.167

ATF3 TRANSFAC M00981 CREB, ATF 0.168 0.692

ATF3 TRANSFAC M00017 ATF 0.296 0.259

CEBPB TRANSFAC M00117 C/EBPbeta 0.091 2.370

CEBPB TRANSFAC M00109 C/EBPbeta 0.342 0.899

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

CEBPB MAPPER T00581 C/EBPbeta 0.267 0.556

CEBPB TRANSFAC M00912 C/EBP 0.599 0.223

CREB1 TRANSFAC M00917 CREB 0.688 5.561

CREB1 TRANSFAC M00916 CREB 0.59 3.957

184 CREB1 MAPPER T00163 CREB 0.323 1.774

CREB1 TRANSFAC M00115 Tax/CREB 0.144 1.139

CREB1 TRANSFAC M00114 Tax/CREB 0.226 0.559

CREB1 TRANSFAC M00981 CREB, ATF 0.23 0.524

CREB1 TRANSFAC M00113 CREB 0.418 0.266

ATF2 TRANSFAC M00017 ATF 0.325 5.336

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

ATF2 MAPPER T00167 ATF-2-xbb4 0.599 1.656

ATF2 TRANSFAC M00981 CREB, ATF 0.198 0.583

CUX1 TRANSFAC M00106 CDP CR3+HD 0.066 0.609

CUX1 TRANSFAC M00104 CDP CR1 0.174100719 0.146

185 CUX1 TRANSFAC M00095 CDP 0.038 0.036

CUX1 TRANSFAC M00105 CDP CR3 0.026 0.014

E2F1 TRANSFAC M00938 E2F-1 0.066 4.711

E2F1 MAPPER T05208 pRb:E2F-1:DP-1 0.193 3.887

E2F1 TRANSFAC M00940 E2F-1 0.446 2.875

E2F1 TRANSFAC M00920 E2F 0.492 1.502

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

E2F1 TRANSFAC M00024 E2F 0.209 1.468

E2F1 MAPPER T00221 E2F:DP 0.391 1.455

E2F1 MAPPER T05205 E2F-1:DP-2 0.168 1.312

E2F1 TRANSFAC M00919 E2F 0.283 1.140

186 E2F1 TRANSFAC M00740 Rb:E2F-1:DP-1 0.285 0.627

E2F1 TRANSFAC M00050 E2F 0.305 0.424

E2F1 MAPPER T01542 E2F-1 0.243902439 0.051

E2F4 MAPPER T05206 E2F-4:DP-1 0.286 3.600

E2F4 TRANSFAC M00920 E2F 0.469 2.754

E2F4 TRANSFAC M00024 E2F 0.254 2.099

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

E2F4 MAPPER T05207 E2F-4:DP-2 0.163 1.803

E2F4 MAPPER T00221 E2F:DP 0.425 1.352

E2F4 TRANSFAC M00739 E2F-4:DP-2 0.395 1.324

E2F4 TRANSFAC M00738 E2F-4:DP-1 0.297 1.318

187 E2F4 TRANSFAC M00050 E2F 0.313 1.186

E2F4 TRANSFAC M00919 E2F 0.304 0.527

ELK1 TRANSFAC M00007 Elk-1 0.222 5.686

ELK1 TRANSFAC M00025 Elk-1 0.567 1.062

ELK1 MAPPER T00250 Elk-1 0.017 0.005

ELK4 MAPPER T00737 SAP-1a 0.004 0.064

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

EP300 TRANSFAC M00033 p300 0.147 0.123

ESR1 MAPPER T00261 ER-alpha 0.017 0.237

ESR1 TRANSFAC M00959 ER 0.048 0.086

ETS1 MAPPER T00112 c-Ets-1 0.033 0.045

188 FOXM1 TRANSFAC M00791 HNF3 0.029 0.015

FOXM1 MAPPER T01104 HNF-3 0.017 0.004

FOS TRANSFAC M00924 AP-1 0.118 0.896

FOS MAPPER T00029 AP-1 0.29 0.180

FOS MAPPER T00123 c-Fos 0.116 0.144

GABPA MAPPER T01390 GABP-alpha 0.097 0.121

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

GATA3 MAPPER T00311 GATA-3 isoform-1 0.06 0.040

NR3C1 MAPPER T05076 GR 0.065 0.411

NR3C1 MAPPER T00337 GR-alpha 0.1 0.314

NR3C1 TRANSFAC M00960 PR, GR 0.001 0.000

189 HNF4A TRANSFAC M00762 PPAR, HNF-4, COUP, RAR 0.143 1.678

HNF4A TRANSFAC M00638 HNF4alpha 0.14 1.535

HNF4A TRANSFAC M00158 COUP-TF, HNF-4 0.179 1.260

HNF4A TRANSFAC M00967 HNF4, COUP 0.03 1.093

HNF4A MAPPER T02758 HNF-4 0.03 0.089

HNF4A TRANSFAC M00764 HNF4 direct repeat 1 0.186 0.043

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

HSF1 MAPPER T00383 HSF 0.002 0.252

HSF1 TRANSFAC M00641 HSF 0.014 0.034

IRF1 TRANSFAC M00772 IRF 0.266 1.543

IRF1 TRANSFAC M00972 IRF 0.228 0.808

190 IRF1 TRANSFAC M00062 IRF-1 0.237 0.693

IRF1 MAPPER T00423 IRF-1 0.032 0.011

JUN MAPPER T00133 c-Jun 0.132 0.280

JUN TRANSFAC M00924 AP-1 0.058 0.195

JUN MAPPER T00029 AP-1 0.121 0.051

JUNB TRANSFAC M00924 AP-1 0.05 0.065

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

JUND TRANSFAC M00924 AP-1 0.141 0.127

MAX MAPPER T05056 Max 0.36 0.506

MEF2A TRANSFAC M00407 RSRFC4 0.012 0.326

MEF2A TRANSFAC M00941 MEF-2 0.014 0.324

191 MEF2A TRANSFAC M00026 RSRFC4 0.018 0.157

MEF2A TRANSFAC M00006 MEF-2 0.053 0.138

MEF2A MAPPER T01005 MEF-2A 0.046 0.051

MEF2A TRANSFAC M00406 MEF-2 0.049 0.050

MEF2A MAPPER T01009 RSRFC4 0.005 0.032

MEF2A MAPPER T01006 aMEF-2 0.122580645 0.002

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

MYC MAPPER T00140 c-Myc 0.124 0.097

NFATC1 TRANSFAC M00935 NF-AT 0.023 0.011

NFE2 TRANSFAC M00037 NF-E2 0.143 4.432

NFIC TRANSFAC M00056 myogenin / NF-1 0.185 0.361

192 NFYA TRANSFAC M00687 alpha-CP1 0.166 1.502

NFYA MAPPER T00150 NF-Y 0.176 1.416

NFYA TRANSFAC M00775 NF-Y 0.205 1.370

NFYB TRANSFAC M00687 alpha-CP1 0.23 2.955

NFYB MAPPER T00150 NF-Y 0.251 2.304

NFYB TRANSFAC M00775 NF-Y 0.312 2.132

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

PAX5 TRANSFAC M00143 Pax-5 0.017 0.201

PAX5 TRANSFAC M00144 Pax-5 0.115 0.043

POU2F2 MAPPER T00647 10/02/2012 0.009 0.193

POU2F2 TRANSFAC M00795 Octamer 0.115 0.002

193 RELA TRANSFAC M00054 NF-kappaB 0.016251354 1.035

RELA TRANSFAC M00052 NF-kappaB (p65) 0.256 0.433

RELA MAPPER T00590 NF-kappaB 0.148 0.221

RELA MAPPER T00594 RelA-p65 0.128 0.005

RXRA MAPPER T05325 LXR-beta:RXR-alpha 0.007 0.291

RXRA TRANSFAC M00631 FXR/RXR-alpha 0.03 0.242

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

RXRA MAPPER T05313 FXR:RXR-alpha 0.043 0.199

RXRA TRANSFAC M00647 LXR 0.062 0.115

RXRA TRANSFAC M00767 FXR inverted repeat 1 0.048 0.111

RXRA MAPPER T01345 RXR-alpha 0.024 0.078

194 RXRA MAPPER T05324 LXR-alpha:RXR-alpha 0.043 0.074

RXRA TRANSFAC M00766 LXR direct repeat 4 0.046 0.060

RXRA TRANSFAC M00963 T3R 0.017 0.052

RXRA TRANSFAC M00518 PPARalpha:RXRalpha 0.043 0.029

RXRA TRANSFAC M00966 VDR, CAR, PXR 0.031 0.013

RXRA TRANSFAC M00964 PXR, CAR, LXR, FXR 0.009 0.003

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

SP1 TRANSFAC M00933 Sp1 0.392 3.309

SP1 TRANSFAC M00932 Sp1 0.105 2.256

SP1 TRANSFAC M00931 Sp1 0.648 1.211

SP1 TRANSFAC M00008 Sp1 0.535 0.087

195 SPI1 TRANSFAC M00658 PU.1 0.185 0.518

SREBF1 MAPPER T01556 SREBP-1a 0.036023055 0.175

SREBF1 TRANSFAC M00776 SREBP 0.011 0.009

SRF TRANSFAC M00810 SRF 0.18 1.674

SRF TRANSFAC M00922 SRF 0.197 1.503

SRF MAPPER T00764 SRF 0.05 0.074

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

STAT1 MAPPER T00428 ISGF-3 0.001 1.822

STAT1 MAPPER T04759 STAT1 0.206 1.108

STAT1 TRANSFAC M00972 IRF 0.071 0.111

STAT1 TRANSFAC M00223 STATx 0.31 0.001

196 STAT2 MAPPER T00428 ISGF-3 0.002 0.336

STAT2 TRANSFAC M00972 IRF 0.011 0.077

STAT2 TRANSFAC M00223 STATx 0.024 0.033

STAT3 MAPPER T01493 STAT3 0.175 1.349

STAT3 TRANSFAC M00223 STATx 0.35 0.392

STAT5A TRANSFAC M00457 STAT5A (homodimer) 0.006 0.072

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

STAT5A TRANSFAC M00460 STAT5A (homotetramer) 0.074 0.001

TBP MAPPER T00794 TBP 0.036 0.009

TCF3 MAPPER T00207 E47 0.04 0.537

TCF3 TRANSFAC M00804 E2A 0.011 0.506

197 TCF3 TRANSFAC M00693 E12 0.134 0.155

TCF3 TRANSFAC M00002 E47 0.134 0.068

TCF3 MAPPER T00204 E12 0.014 0.005

TCF3 TRANSFAC M00929 MyoD 0.076 0.004

ZEB1 MAPPER T00625 ZEB (1124 AA) 0.003 0.139

ZEB1 TRANSFAC M00414 AREB6 0.014 0.014

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

ZEB1 TRANSFAC M00413 AREB6 0.069 0.012

ZEB1 TRANSFAC M00412 AREB6 0.022 0.005

NR2F2 TRANSFAC M00765 COUP direct repeat 1 0.284 1.258

NR2F2 TRANSFAC M00762 PPAR, HNF-4, COUP, RAR 0.238 0.788

198 NR2F2 TRANSFAC M00155 ARP-1 (COUP-TF2) 0.207 0.630

NR2F2 MAPPER T00045 COUP-TF2 0.124156545 0.139

NR2F2 TRANSFAC M00967 HNF4, COUP 0.063 0.041

USF1 MAPPER T00874 USF1 0.057 6.019

USF1 TRANSFAC M00796 USF 0.634 0.053

YY1 TRANSFAC M00059 YY1 0.076 0.915

Continued on next page Table 2.3.15 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

YY1 TRANSFAC M00069 YY1 0.328 0.128

YY1 MAPPER T00915 YY1 0.093 0.068

YY1 TRANSFAC M00793 YY1 0.13 0.057

FOSL1 TRANSFAC M00924 AP-1 0.03 0.075

199 1 5 Average Precision (10− ) Table 2.3.16: Performance of MAPPER2 TF Models Validated by ENCODE Mouse ChIP-seq Data

TF Collection Model ID Model Name AccuracyAve Prec1

Cebpb TRANSFAC M00912 C/EBP 0.005 4.611

Cebpb MAPPER T00017 C/EBPbeta(p35) 0.016 0.254

Cebpb TRANSFAC M00109 C/EBPbeta 0.04 0.041

Cebpb TRANSFAC M00117 C/EBPbeta 0.143 0.005 200 Gata1 MAPPER T00305 GATA-1 0.043 0.037

Jun MAPPER T00131 c-Jun 0.004 0.028

Jun TRANSFAC M00924 AP-1 0.016 0.019

Jun MAPPER T00032 AP-1 0.013 0.002

Jund MAPPER T00437 JunD 0.007782101 0.034

Jund TRANSFAC M00924 AP-1 0.02 0.002

Continued on next page Table 2.3.16 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

Mafk TRANSFAC M00037 NF-E2 0.581 14.487

Myb MAPPER T00138 c-Myb 0.006 0.009

Myod1 TRANSFAC M00804 E2A 0.036 0.263

Myod1 TRANSFAC M00929 MyoD 0.08 0.054

201 Myog TRANSFAC M00804 E2A 0.036 0.274

Myog TRANSFAC M00929 MyoD 0.096 0.265

Myog TRANSFAC M00712 myogenin 0.103 0.041

Pax5 TRANSFAC M00144 Pax-5 0.001 0.118

Pax5 TRANSFAC M00143 Pax-5 0.002 0.013

Srf TRANSFAC M00922 SRF 0.064 1.266

Continued on next page Table 2.3.16 – continued from previous page

TF Collection Model ID Model Name AccuracyAve Prec1

Srf TRANSFAC M00810 SRF 0.065 1.078

Srf MAPPER T00765 SRF-L 0.062 0.761

Tcf3 TRANSFAC M00804 E2A 0.012 0.058

Tcf3 TRANSFAC M00929 MyoD 0.016 0.038

202 Usf1 TRANSFAC M00796 USF 0.294 10.486

Usf1 MAPPER T00877 USF1 0.013 0.017

Ets1 TRANSFAC M00032 c-Ets-1(p54) 0.162 0.475

Ets1 MAPPER T00111 c-Ets-1 0.059 0.060

1 5 Average Precision (10− ) An outlier corresponding to Mafk is seen in Figures 2.3.4 and 2.3.5. Four models in LASAGNA-Search and one MAPPER2 model were used to predict

Mafk binding sites (see Tables 2.3.14 and 2.3.16). Interestingly, the best model of each tool is based on the same TRANSFAC matrix M00037. The

LASAGNA-Search model is a PWM model that has no position dependence information. The MAPPER2 model, however, is a HMM model, which considers position dependence. The use of position dependence gave the

MAPPER2 model an edge over the LASAGNA-Search model. The other three LASAGNA-Search models performed much worse than the best one, resulting in poor average performance on Mafk. While it is difficult to draw conclusions on mouse TFs based on 13 TFs, the results on human TFs indicate that LASAGNA-Search models are significantly better. Overall, we observe that LASAGNA-Search significantly outperforms MAPPER2, indicating that models in LASAGNA-Search more accurately predict TFBSs.

2.4 Future Directions

We plan to improve LASAGNA-Search in two aspects, expanding the content and incorporating useful features. More species will be supported in automatic promoter retrieval and visualization in the UCSC Genome browser. To expand our TF model collections, more sources of TFBSs and

PWMs such as the PAZAR database [64] and ChIP-seq data will be consid-

203 ered. The GBP score [21] is based on multiple evidence sources including evolutionary conservation and has been shown to improve prediction of binding sites. Integration of the GBP scores with the search module will be investigated.

Using a cluster of TF models to scan a sequence for binding sites has been shown to outperform using the best model in the cluster [62]. This strategy will beneﬁt from our large collections of TF models and improve the TFBS search performance of LASAGNA-Search. Finally, we will enable the search for two-block motifs [8, 38], that is, binding sites composed of two half sites separated by variable-length gaps. While plenty of work has been devoted to de novo two-block motif discovery [8, 56, 54], searching for two-block motif instances is straight-forward. Using two TF models with or without gap penalty [8] will be investigated.

204 A Homo sapiens B Mus musculus C Overall

51 TFs 13 TFs ● Mafk 64 TFs ● p−value: 0.00017 0.00015 p−value: 0.02 0.00015 p−value: 2.9e−05 6e−05 5e−05

● 0.00010 0.00010 4e−05

● MAPPER2 ● 3e−05

● ● ● ● ● ● ● ● 0.00005 0.00005 ● 2e−05 ● ●

● ● ● ●

205 ● ● ● ● ● ● ● ● ● ●●

1e−05 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ●● ● ● ● ●● ● ● ●●●● ●● ● ●●●● ● ● ●● ●● ●●●●●●●● ●● ● ● ● ● 0e+00 0.00000 0.00000 0e+00 1e−05 2e−05 3e−05 4e−05 5e−05 6e−05 0.00000 0.00005 0.00010 0.00015 0.00000 0.00005 0.00010 0.00015

LASAGNA−Search LASAGNA−Search LASAGNA−Search

Figure 2.3.4: Comparison of Precomputed TF Models by Average Precision

Plots of performance of MAPPER2 against that of LASAGNA-Search for (A) human TFs, (B) mouse TFs and (C) all the TFs, respectively. Each point in a plot corresponds to a TF, whose binding sites can be predicted by more than one model. Each model is scored by average precision and the average score across the models is used to plot the point. The outlier, Mafk, in (B) is marked in red. The number of TFs and the p-value of a Wilcoxon signed-rank test are shown for each plot. A Homo sapiens B Mus musculus C Overall 0.6 51 TFs 13 TFs ● Mafk 64 TFs p−value: 7e−06 p−value: 0.034 p−value: 1.4e−06 0.6 0.6 ● 0.5 0.5 0.5

● 0.4 ● 0.4 0.4 ● ● ● ● ● ● ● ●

● ● ● ● 0.3 0.3 0.3 ● ● MAPPER2 ● ● ● ● ● ● ● ●

● ● 0.2

0.2 ●

0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 206 ● ● ● 0.1 ● 0.1 0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 0.0 0.0 0.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6

LASAGNA−Search LASAGNA−Search LASAGNA−Search

Figure 2.3.5: Comparison of Precomputed TF Models by Accuracy

Plots of performance of MAPPER2 against that of LASAGNA-Search for (A) human TFs, (B) mouse TFs and (C) all the TFs, respectively. Each point in a plot corresponds to a TF, whose binding sites can be predicted by more than one model. Each model is scored by accuracy and the average score across the models is used to plot the point. The outlier, Mafk, in (B) is marked in red. The number of TFs and the p-value of a Wilcoxon signed-rank test are shown for each plot. Chapter 3 Searching for Transcription Factor Binding Sites in Vector Spaces

3.1 Background

In this chapter, we describe searching for transcription factor binding sites in vector spaces. Specifically, l-mers are placed in the Euclidean space such that each l-mer corresponds to a vector in the space. With known binding sites of a TF, we construct a profile vector for the TF. This profile vector can then be used as a query vector to search for the unknown binding sites in the space given a similarity measure between two vectors. The vector space model has long been used in information retrieval (IR) [70, 53]. Under this model, each document in a collection is embedded in a t-dimensional space.

That is, each document is represented by a t-element vector, where t is the number of distinct terms present in the document collection or corpus. To search for documents on a particular topic, a query composed of terms relevant to the topic is constructed. The query can be similarly embedded in the t-dimensional space. Similarity between the query and a document can then be measured by measuring the similarity between the two corresponding

207 vectors. In the TFBS search problem, the entire genome or the collection of promoter region sequences corresponds to the corpus, whereas an l-mer is analogous to a document in IR. On the other hand, a TF is analogous to a topic, while a TF representation is the analog of a query for the topic.

In this framework, we propose two novel approaches to constructing a query vector for a TF of interests. They are named the negative-to-positive vector

(NPV) and the optimal discriminating vector (ODV) methods. We compare the proposed methods to a state-of-the-art method, the ULPB method, as well as the widely-used PSSM method. Performance of a method is assessed by cross-validation experiments on two data sets collected from RegulonDB

[27] and JASPAR [65], respectively. Based on the NPV and ODV methods, we investigate motif subtype identiﬁcation and show that, consistent with previous studies [35, 28], identiﬁcation of motif subtypes improves TFBS search performance. Independent validation on human ChIP-seq data gives further insights into the proposed methods. Finally, we discuss the advantages of searching for TF binding sites in the proposed framework.

3.2 Methods

3.2.1 Data Sets

To understand the compared methods in this chapter, we experimented on

208 Table 3.2.17: Statistics of the E. coli TFs in RegulonDB

Name Length # TFBSs Name Length # TFBSs MetJ 8 29 Lrp 12 62 SoxS 18 19 H-NS 15 37 FlhDC 16 20 AraC 18 20 Fis 15 206 ArcA 15 93 IHF 13 101 OmpR 20 22 PhoB 20 17 GlpR 20 23 OxyR 17 41 CpxR 15 37 NarL 7 90 CRP 22 249 TyrR 18 19 NarP 7 20 Fur 19 81 LexA 20 40 NtrC 17 17 FNR 14 87 MalT 10 20 PhoP 17 21 ArgR 18 32 NsrR 11 37

prokaryotic as well as eukaryotic transcription factors. The known prokaryotic TF binding sites were collected from RegulonDB [27] release 6.8. Con- sidered in [69], this data source contains binding sites of TFs in the E. coli

K-12 genome. We considered a data set of 26 TFs with 17 or more known binding sites. The ﬁltering criterion ensures that, for each TF, we have enough examples to learn from. Similar ﬁltering criteria were used in [69].

This data set is summarized in Table 3.2.17.

The known eukaryotic TF binding sites were collected from JASPAR CORE database (the 4th release) [65]. TFs of Homo sapiens and Mus musculus were ﬁltered by two criteria. A TF was kept only if it has at least 20 known binding sites and the length of its binding sites is at least 6 nucleotides.

209 The length criterion, arbitrarily chosen, ensures a TF under consideration is speciﬁc enough. This data set is summarized in Table 3.2.18.

3.2.2 Notation

For clarity, we list and deﬁne functions and variables used throughout this chapter.

fi(u) denotes the probability of observing letter u at position i of a TFBS, • where u A, C, G, T . ∈ { }

fi j(u, v) denotes the probability of observing letters u and v at positions • , i and j, respectively, where i < j and u, v A, C, G, T . ∈ { }

fi(v u) denotes the position-speciﬁc conditional probability of observ- • | ing v at position i + 1 given u has been seen at position i, where

u, v A, C, G, T . ∈ { }

f (v u) denotes the background conditional probability of observing • | v given u has been observed at the previous position, where u, v ∈ A, C, G, T . { }

Iu( ) is the indicator function given by • ·    1 if v = u, I (v) =  (3.2.1) u   0otherwise,

where u, v A, C, G, T . ∈ { } 210 Iu u ( ) is similarly deﬁned as follows: • 1 2 ·    1if v1 = u1 and v2 = u2, I (v v ) =  (3.2.2) u1u2 1 2   0 otherwise,

where u , u , v , v A, C, G, T . 1 2 1 2 ∈ { }

ICi denotes the information content at position i of a binding site. • Information content is closely related to entropy, a measure of uncer-

tainty in information theory. The entropy at position i is given by Ei =

P 1 u A, C, G, T fi(u) log2 fi(u) . When fi(u) = 4 for all u A, C, G, T , − ∈{ } ∈ { }

Ei attains the maximal entropy of 2 and we are most uncertain about

the letter at position i. ICi is simply deﬁned as

X ICi = 2 Ei = 2 + fi(u) log fi(u) . (3.2.3) − 2 u A, C, G, T ∈{ }

ICi j denotes the information content of the position pair (i, j) of a • , binding site. Similarly,

X h i ICi,j = 4 + fi,j(u, v) log2 fi,j(u, v) , (3.2.4) u,v A, C, G, T ∈{ }

1 where the maximal entropy of 4 is attained when fi,j(u, v) = 16 for all

u, v A, C, G, T . ∈ { }

3.2.3 Embedding Short Sequences in Vector Spaces

We describe how a short sequence of l nucleotides or an l-mer is placed in

211 AGTG……CTCT

1000001000010010……0100000101000001

Figure 3.2.1: Illustration of Embedding a Short Sequence in Vector Space

Each nucleotide in the sequence is converted to 4 indicator variables.

th a vector space. Let s be an l-mer and si denote its i nucleotide. Each nucleotide in s is converted to 4 variables, that is, si is converted to wiIA(si), wiIC(si), wiIG(si) and wiIT(si) for i = 1, 2,..., l. Hence, s is converted to 4l variables, placing s in R4l.

Figure 3.2.1 illustrates the conversion of each nucleotide in an l-mer to 4 variables when wi = 1 for i = 1, 2,..., l.

We further consider nucleotide pair (si, sj), where i < j. Only pairs in close proximity are considered in this study. We consider (si, sj) only if j i = 1 − or 2, i.e., a pair of nucleotides is considered only if they are adjacent or separated by one nucleotide. Nucleotide pair (si, sj) is similarly converted to

16 variables, wi,jIAA(sisj), wi,jIAC(sisj),..., wi,jITT(sisj), as there are 16 possible nucleotide pairs, AA, AC, ..., TT . We use 32l 48 additional variables to { } − encode the pairs since there are l 1 adjacent pairs and l 2 pairs separated − − by one nucleotide. Consequently, considering individual nucleotides and nucleotide pairs, each l-mer is converted to a (36l 48)-element vector. −

212 In this study, we consider two choices of wi’s and wi,j’s. For the ﬁrst choice, all the nucleotides and nucleotide pairs are given the same weight, i.e.,

th wi = 1 and wi,j = 1 for all i and j. The second one assigns weight to the i nucleotide according to the information content at position i. Similarly, it assigns weight to pair (i, j) according to the information content at this pair p of positions. Speciﬁcally, wi = √ICi and wi,j = ICi,j for all i and j.

3.2.4 Searching for TFBSs in Vector Spaces

Given a query vector t in space, we score an l-mer s as follows:

Score(s) = sTt, (3.2.1) where s denote the corresponding vector of s. In other words, the score of s is obtained by taking the dot-product between s and t. It can be seen that Score(s) measures the similarity between s and t. Assuming that t corresponds to an l-mer t, Score(s) counts the number of nucleotides and nucleotide pairs shared between s and t when wi = 1 and wi,j = 1 for all i and j. However, we note that t can be any vector in the space and does not necessarily correspond to an l-mer.

As described above, an l-mer is converted to a (36l 48)-element vector. − (36l 48) Hence, we use t to search for binding sites in R − . Our approach oﬀers great ﬂexibility in that it easily allows searching for binding sites in a lower

213 dimensional subspace. By setting all but the ﬁrst 4l elements in t to zero, we are essentially searching for binding sites in R4l. In this chapter, we exploit this advantage and simultaneously search for transcription factor binding

4l (36l 48) sites in three subspaces. Two of them are R and R − . The third one is

(16l 12) R − . This subspace is obtained from considering only the ﬁrst nucleotide and the l 1 adjacent nucleotide pairs as in a ﬁrst order Markov chain. − 3.2.5 The NPV Method

We ﬁrst introduce a simple approach to constructing a query vector. Let

P be the set of n+ binding sites and N be the set of n non-binding sites − of a particular transcription factor. We embed all the l-mers in P and N in

(36l 48) R − . We then ﬁnd the mean binding site vector

1 X µ = s + n + s P ∈ as well as the mean non-binding site vector

1 X µ = s. − n s N − ∈

The query vector t is found by subtracting µ from µ+, that is, t = µ+ µ . − − − The query vector t can be seen as the vector pointing from the center of the non-binding site vectors to the center of the binding site vectors. Hence, we call it the negative-to-positive vector (NPV) method. Figure 3.2.2 illustrates the idea.

214 ● non−TFBS ● µ− TFBS µ

−2.4 + −2.6 ● Axis 2 −2.8 −3.0

−1 0 1 2

Axis 1

Figure 3.2.2: Illustration of the NPV Method

The solid arrow represents the negative-to-positive vector µ+ µ , pointing − − from µ to µ+. The hallow triangles denote the known binding sites, whereas − the circles represent the known non-binding sites. The center of the binding site vectors is marked by the solid triangle, while the center of the non- binding site vectors is marked by the solid circle.

The score of an l-mer s given by the NPV method is therefore

T T T Score(s) = s (µ+ µ ) = s µ+ s µ . (3.2.1) − − − −

We can see that it computes the similarity between s and the mean binding site vector as well as the similarity between s and the mean non-binding site vector. It then scores s by the diﬀerence of the two similarity scores. The

215 more similar s is to the mean binding site vector, the higher the score. The less similar s is to the mean non-binding site vector, the higher the score.

From the perspective of geometry, we note that Score(s) in (3.2.1) is proportional to Score(s)/ t , where t is the length of the query vector t. Moreover, || || || || by virtue of the equality

sTt = s t cos θ, || || || || we know Score(s)/ t equals the orthogonal projection of s onto t, where θ || || is the angle formed by vectors s and t (see Figure 3.2.3 for an illustration).

The computation of Score(s) is therefore equivalent to computation of the orthogonal projection of s onto t. Similarly, the computation of Score(s) in

(3.2.1) is equivalent to computation of the orthogonal projection of s onto

µ+ µ . In Figure 3.2.2, we observe that vector µ+ µ is pointing to the − − − − left and, projected onto this vector, most of the binding sites are on the left of the non-binding sites. This implies that most of the binding sites have a higher score than the non-binding sites.

3.2.6 The ODV Method

We have described the NPV method, which oﬀers a heuristic way of constructing a query vector. We now introduce a way of ﬁnding an optimal

(36l 48) query vector β R − . Suppose that P = n+ and N = n , that is, ∈ | | | | − 216 s t

Figure 3.2.3: The Orthogonal Projection of s onto t

It can be seen that the projection of s onto t is equal to Score(s)/ t Score(s). || || ∝

there are n+ binding sites and n non-binding sites for a particular TF. Let −

P = s , s ,..., s n and N = s n , s n ,..., s n , where s i denotes the { (1) (2) ( +)} { ( ++1) ( ++2) ( )} ( ) th i l-mer in the union of the two sets and n = n+ + n . We ﬁnd the optimal β − by solving the following minimization problem:

Xn+ Xn 1 2 C C min β + ξi + ξi (3.2.1) β,b,ξ 2|| || n+ n i=1 − i=n++1

Score(s(i)) b + 1 ξi subject to − for s i P, (3.2.2) β ≥ β ( ) ∈ || || || || Score(s(i)) b 1 + ξi − for s i N, (3.2.3) β ≤ β ( ) ∈ || || || ||

ξi 0 i. (3.2.4) ≥ ∀

The constraint in (3.2.2) ensures that the projection of a TFBS s(i) onto the vec-

Score(s(i)) b+1 tor β, β , exceeds the threshold β . On the other hand, the constraint || || || || in (3.2.3) ensures that the projection of a non-TFBS s(i) onto β stays below the

b 1 threshold β− . Flexibility is given to the thresholds by introducing ξi’s with || || cost captured by the last two terms in (3.2.1). Finally, to clearly distinguish

217 TFBSs from non-TFBSs, the squared diﬀerence between the two thresholds

2 b+1 b 1 2 ( β and β− ) is made as large as possible. This amounts to maximizing β || || || || || || or, equivalently, minimizing 1 β 2, which is the ﬁrst term in (3.2.1). We call 2 || || this approach the optimal discriminating vector (ODV) method.

The optimization problem in (3.2.1) is known as a quadratic programming problem with linear inequality constraints speciﬁed in (3.2.2), (3.2.3) and

(3.2.4). There are p + n + 1 variables and 2n constraints, where p = 36l − 48 is the dimension of β. We can see that (3.2.2) and (3.2.3) specify n constraints whereas (3.2.4) imposes n constraints on the variables. Quadratic programming [7] is well-studied and hence general solvers are available, e.g., the OpenOpt framework [43]. To solve this problem, the parameter

C(> 0) is ﬁrst arbitrarily chosen. A solver then searches for values of β =

T T (β1, . . . , βp) , b and ξ = (ξ1, . . . , ξn) such that the objective function in (3.2.1) is minimized while the constraints in (3.2.2), (3.2.3) and (3.2.4) are satisfied simultaneously. It can be seen that an optimal solution to (3.2.1) always exists since the search space of β, b, ξ is never empty. To find a feasible { } p solution, one can arbitrarily pick β , 0 R and b R. For s(i) P, one ∈ ∈ ∈ can pick ξi R such that the constraint in (3.2.2) is satisfied. Similarly, for ∈ s i N, one can pick ξi R such that the constraint in (3.2.3) is met. We ( ) ∈ ∈ can then compute the value of the objective function as the values of all the

218 variables are known. One way to choose the parameter C in (3.2.1) is to search for C in a range by cross-validation. The parameter is TF-dependent

6 in general, but experiments showed that a small C = 2− will usually suﬃce

6 and hence we set C = 2− for all the ODV experiments in this study.

3.2.7 The PSSM and ULPB Methods

We brieﬂy describe the ungapped likelihood under positional background

(ULPB) method proposed in [69] and the position-speciﬁc scoring matrix

(PSSM) method compared therein. We refer readers to Section 3.2.2 for functions and variables used here. Consider a speciﬁc TF with binding sites of length l. The PSSM method scores an l-mer s by

l X log fi(si) , (3.2.1) i=1

th where si denotes the i letter in s. We note that usually the ratio fi(si)/ f (si) is used in place of fi(si), where f (si) is the background probability of si. The simpler form in (3.2.1) was compared in [69] and hence it serves as a baseline method in this study.

The ULPB models a TFBS by a ﬁrst-order Markov chain and models the background by another ﬁrst-order Markov chain. The background transi- tion probabilities are estimated using the entire genome of a species and hence the ULPB method uses negative examples implicitly. It scores an

219 l-mer s by

l 1 ! X− fi(si+1 si) log f1(s1) + log | . (3.2.2) f (si+1 si) i=1 |

Although ULPB does not consider background probability in the ﬁrst term of

(3.2.2), the score is approximately the log-likelihood ratio of the two Markov chains.

The main diﬀerence between the PSSM method and the NPV, ODV and

ULPB methods is that the PSSM method does not score nucleotide pairs nor does it utilize a background distribution. The NPV and ODV methods explicitly take advantage of negative binding sites, while the ULPB method does it implicitly by using a background distribution. The ﬂexibility of the proposed framework allows the NPV and ODV methods to easily search in subspaces, further distinguishing the PSSM and ULPB methods from the proposed ones.

3.3 Results and Discussion

3.3.1 Performance Assessment and Evaluation Metrics

The performance of a TFBS search method is evaluated by ν-fold cross- validation (CV). Consider a TF with n+ TFBSs of length l with ﬂanking regions on both sides. A set of negative examples, Ntest, called the test neg-

220 atives is constructed from the TFBSs of the other TFs with ﬁltering as in

[63]. Another set of negative examples, Ntrain, called the training negatives is collected from sequences embedding the n+ binding sites. It is comprised of all the l-mers except for the TFBSs and two neighboring l-mers of each TFBS.

The n TFBSs are ﬁrst divided into ν sets, each of which contains n+ or + b ν c n+ + 1 TFBSs. At each iteration of ν-fold CV, one of the ν TFBS sets called b ν c the test TFBS set Ptest is left out. The rest of the TFBSs are therefore called the training TFBSs. A scoring function is obtained using the training TFBSs and non-TFBSs randomly sampled from the training negatives, where the ratio of numbers of non-TFBSs to TFBSs is set to 10. The test TFBSs in Ptest along with the non-TFBSs in Ntest are then scored by the scoring function.

To score a test sequence, both the forward and reverse strands are scored and, in case the test sequence is longer or shorter than l, the l-mer producing the highest score is used. For each test TFBS t P , we ﬁnd its ∈ test rank relative to all the non-TFBSs in Ntest. Formally, the rank of t equals

1 + s N Score(s) Score(t) . |{ ∈ test| ≥ }|

After the ν-fold CV, we end up with n+ ranks, each of which corresponds to a TFBS. To allow comparison of methods, we use the area under the ROC curve (AUC) to gauge the performance of a method on the TF. The ROC

221 curve is a plot of true positive rate (TPR) against false positive rate (FPR), displaying the trade-oﬀ between TPR and FPR. We refer readers to [23] for an introduction to this metric. In this study, ν = 10 for all the CV experiments.

For the NPV and ODV methods, the best weight and subspace combination is obtained at each iteration of the ν-fold CV. Speciﬁcally, another (ν 1)-fold − CV is performed on the ν 1 sets of TFBSs to search for the best combination. −

3.3.2 Prokaryotic Transcription Factor Binding Sites

To understand the behavior of search methods on prokaryotic TF binding sites, we conducted 10-fold cross-validation experiments on the 26-TF Reg- ulonDB data set. The proposed NPV and ODV methods were compared to the ULPB method [69]. The PSSM method, considered in [69], was also included for comparison since it served as a simple baseline method.

Figure 3.3.1a shows the plot of area under the ROC curve (AUC) across the

26 TFs for each method. We can see that the ODV method has the best

AUC on 12 out of 26 TFs and the NPV method has the best AUC on 9 out of 26 TFs whereas the ULPB and PSSM methods have the best AUC on 1 and 4 TFs, respectively. To gauge the relative performance between two methods, statistical tests [87] were performed on all the 6 pairs of methods.

Figure 3.3.1b shows the p-values of the pairwise comparisons. We ﬁrst notice that, consistent with the results in [69], ULPB outperformed PSSM with a

222 slightly larger p-value of 0.0679 than the usual 0.05 significance cut-off. As seen in Figure 3.2.3b, the NPV and ODV methods are significantly better than the PSSM and ULPB methods. We can see that the ODV method benefited from optimization albeit minimizing the objective function in (3.2.1) does not guarantee maximization of the AUC.

3.3.3 Eukaryotic Transcription Factor Binding Sites

Here we compare the proposed NPV and ODV methods to the ULPB and

PSSM methods on eukaryotic TF binding sites. As in the previous section, we conducted 10-fold cross-validation experiments on the 28-TF JASPAR data set. Figure 3.3.2a shows the plot of AUC across the 28 TFs for each method. We can see that both the ODV and NPV methods have the best

AUC on 13 out of 28 TFs while the ULPB and PSSM methods have the best

AUC on 6 and 4 TFs, respectively. All the methods have the best AUC of 1 on MA0149.1 and MA0115, while the ODV, NPV and PSSM methods have the best AUC of 0.999 on MA0137.2.

Similarly, statistical tests [87] were performed on all the 6 pairs of methods.

Figure 3.3.2b shows that the NPV and ODV methods are signiﬁcantly better than the PSSM and ULPB methods. ULPB is signiﬁcantly better than PSSM, which is again consistent with the results reported in [69]. Overall, performance of the four methods remain unchanged as we shift from prokaryotic

223 transcription factors to eukaryotic ones. This implies that a TFBS search method eﬀective on prokaryotic transcription factors will perform equally well on eukaryotic transcription factors and vice versa.

3.3.4 Motif Subtype Identiﬁcation in Vector Spaces

It has been shown that the binding sites of a TF can be better represented by

2 motif subtypes than by a single motif [35, 28]. In search for new binding sites, two position-speciﬁc scoring matrices are used to score an l-mer and the higher score of the two is assigned to this l-mer. Searching with two

PSSMs was shown to be superior to searching with a single PSSM by cross- species conservation statistics in these studies.

We demonstrate that motif subtypes can be readily identified once we embed l-mers in a vector space. The purpose here, however, is not to compare motif subtype identification algorithms. We adopted a slightly different approach to motif subtype identification from those in previous work [35, 28], while the idea is similar. As usual, all the l-mers were first embedded in a vector space. The known binding sites of a TF were clustered into two subtypes by the k-means algorithm [17]. Immediately, we have a variant of the NPV method called the kNPV method, where k = 2 denotes the number of motif subtypes. The kNPV method first computes the mean vectors of these two

224 subtypes, µ+1 and µ+2, and scores an l-mer s by

n T T o Score(s) = max s (µ+1 µ ) , s (µ+2 µ ) , − − − − where µ is the mean vector of the non-binding sites. Figure 3.3.3 illustrates − the kNPV method.

Similarly, the kODV method scores an l-mer s by

n o Score(s) = max sTβ / β , sTβ / β , +1 || +1|| +2 || +2||

where β+i is obtained using TFBSs in cluster i, i = 1, 2. Unlike the kNPV method, the lengths of β+i’s may be very diﬀerent and hence β+i’s are scaled to unit vectors so as not to bias the scoring function. We note that the choice of k = 2 came from previous studies [35, 28]. Generally, k can be greater than

2 or even automatically selected [37]. This however is beyond the scope of this study and may be investigated in the future.

We assessed the kNPV and kODV methods by 10-fold cross-validation on both the RegulonDB and JASPAR data sets. Figure 3.3.4 shows the results in terms of AUC. We observe in Figure 3.3.4a that overall introducing motif subtypes into the NPV and ODV methods improves the search performance

7 5 (p-values: 6.41 10− and 8.31 10− , respectively). Results in Figure 3.3.4b × × 225 3 3 also support this observation (p-values: 1.61 10− and 3.04 10− , respec- × × tively). The kNPV and kODV are comparable on both the RegulonDB and

JASPAR data sets (p-values: 0.197 and 0.47, respectively). These results are consistent with those reported in [35, 28].

3.3.5 Independent validation on ChIP-seq Data

To evaluate the proposed NPV and ODV methods on the whole genome scale, we built TF models using TFBSs in the JASPAR database to scan all the human (build hg19) 1000-base promoter sequences obtained from the

UCSC Genome Browser database [26]. ChIP-seq peaks from the ENCODE project were also retrieved [68]. Speciﬁcally, the wgEncodeRegTfbsClus- teredV2 table of build hg19 was obtained. We checked TFs in Table 3.2.18 against the annotations and found 14 JASPAR TFs, recognized by 17 anti- bodies present in the ENCODE annotations. The mapping is listed in the

ﬁrst 3 columns of Table 3.3.19.

226 Table 3.2.18: Statistics of TFs in the JASPAR Database

Mus musculus ID Name Length # TFBSs MA0039.2 Klf4 10 4336 MA0047.2 Foxa2 12 809 MA0062.2 GABPA 11 87 MA0065.2 PPARG::RXRA 15 839 MA0104.2 Mycn 26 85 MA0141.1 Esrrb 12 3613 MA0142.1 Pou5f1 15 1332 MA0143.1 Sox2 15 666 MA0144.1 Stat3 19 830 MA0145.1 Tcfcp2l1 14 3931 MA0146.1 Zfx 20 477 MA0147.1 Myc 10 682 MA0154.1 EBF1 10 21 Homo sapiens ID Name Length # TFBSs MA0037 GATA3 6 20 MA0052 MEF2A 10 31 MA0077 SOX9 9 45 MA0080.2 SPI1 7 35 MA0083 SRF 12 26 MA0112.2 ESR1 20 472 MA0115 NR1H2::RXRA 17 22 MA0137.2 STAT1 15 2082 MA0138 REST 19 22 MA0138.2 REST 11 871 MA0139.1 CTCF 11 944 MA0148.1 FOXA1 11 896 MA0149.1 EWSR1-FLI1 17 101 MA0159.1 RXR::RAR DR5 17 23 MA0258.1 ESR2 18 356

227 1

0.9

0.8

070.7 NPV ODV ULPB PSSM 0.6

PWM 0.5 ● AUC ULPB PSSM 0.5 0.0679 7.24e−05 0.000802 NPV 0.4 ODV 0.3 ULPB 0.0679 0.5● 0.00817 0.000802 228 020.2

NPV 7.24e−05 0.00817 0.5● 0.291 0.1

0 ODV 0.000802 0.000802 0.291 0.5● Fis NS Lrp Fur IHF ‐ CRP FNR TyrR NtrC SoxS NarL ArcA LexA AraC NsrR GlpR ArgR NarP MetJ CpxR H MalT OxyR PhoP PhoB FlhDC OmpR (a) (b) Figure 3.3.1: Comparison of the PSSM, ULPB, NPV and ODV Methods on the RegulonDB Data Set

(a) Plot of AUC values across the 26 prokaryotic TFs for each method. (b) Matrix of p-values from pairwise comparisons. A red solid circle in row i and column j indicates that method i outperformed method j, while a blue one in row i and column j indicates that method i is inferior to method j. The size and darkness of a circle imply the significance of the relationship between two methods. The larger and darker a circle, the more significant the relationship. White background indicates exceeding the usual 0.05 significance cut-off, while gray background indicates the opposite. 1

0950.95

0.9 NPV ODV ULPB PSSM

0.85 C

UU PWM A PSSM 0.5● 0.0468 2.42e−05 7.71e−05 0.8 ULPB NPV ODV

0.75 ULPB 0.0468 0.5● 0.0154 0.00189 229 070.7 NPV 2.42e−05 0.0154 0.5● 0.39

0.65

ODV 7.71e−05 0.00189 0.39 0.5● MA0077 MA0138 MA0037 MA0083 MA0052 MA0115 MA0062.2 MA0104.2 MA0144.1 MA0065.2 MA0146.1 MA0080.2 MA0159.1 MA0112.2 MA0139.1 MA0154.1 MA0143.1 MA0141.1 MA0138.2 MA0145.1 MA0148.1 MA0039.2 MA0047.2 MA0147.1 MA0142.1 MA0258.1 MA0137.2 MA0149.1 (a) (b)

Figure 3.3.2: Comparison of the PSSM, ULPB, NPV and ODV Methods on the JASPAR Data Set

(a) Plot of AUC values across the 28 eukaryotic TFs for each method. (b) Matrix of p-values from pairwise comparisons. A red solid circle in row i and column j indicates that method i outperformed method j, while a blue one in row i and column j indicates that method i is inferior to method j. The size and darkness of a circle imply the significance of the relationship between two methods. The larger and darker a circle, the more significant the relationship. White background indicates exceeding the usual 0.05 significance cut-off, while gray background indicates the opposite. ● non−TFBS 2.5 ● µ− TFBS µ+ 2.0 1.5 1.0 Axis 2

● 0.5 0.0

1.0 1.5 2.0 2.5 3.0

Axis 1

Figure 3.3.3: Illustration of the kNPV Method

The solid arrows represent the negative-to-positive vectors µ+1 µ and − − µ+2 µ , pointing from µ to µ+1 and µ+2, respectively. The hallow triangles − − denote− the known binding sites, whereas the circles represent the known non-binding sites. The centers of the binding site vectors are marked by the solid triangles, while the center of the non-binding site vectors is marked by the solid circle.

230 1.00 1.00

0900.90 0950.95

0.80 0.90

0.70 0.85 CC ODV ODV 0.60 AU

AUC kODV kODV 0.80 NPV NPV 0500.50 kNPV kNPV 0.75 231

0.40

0.70

0.30

0.65 0.20 Fis Lrp Fur IHF MA0115 FNR MA0077 MA0138 MA0037 MA0052 MA0083 NtrC CRP TyrR NarL MetJ ArcA MalT AraC GlpR NarP LexA NsrR ArgR SoxS PhoB PhoP H-NS CpxR OxyR MA0112.2 FlhDC OmpR MA0062.2 MA0080.2 MA0104.2 MA0065.2 MA0144.1 MA0146.1 MA0159.1 MA0154.1 MA0145.1 MA0141.1 MA0138.2 MA0039.2 MA0258.1 MA0047.2 MA0139.1 MA0148.1 MA0143.1 MA0147.1 MA0142.1 MA0137.2 MA0149.1 (a) (b)

Figure 3.3.4: The kNPV (kODV) Method Versus the NPV (ODV) Method

The number of motif subtypes k is set to 2. (a) Plot of AUC values across the 26 prokaryotic TFs for each method. (b) Plot of AUC values across the 28 eukaryotic TFs for each method. Table 3.3.19: Results of Independent Validation on ChIP-seq Data

ENCODE JASPAR Name PSSM ULPB NPV S1 IC ODV S IC GATA3 (SC-268) MA0037 GATA3 0.48922 0.46841 0.50963 1 Y 0.51441 1 Y MEF2A MA0052 MEF2A 0.42566 0.45955 0.35283 3 Y 0.34807 3 N PU.1 MA0080.2 SPI1 0.50631 0.49267 0.57575 3 Y 0.58014 3 N SRF MA0083 SRF 0.34299 0.38457 0.43920 2 N 0.43183 3 N MA0138 REST 0.50615 0.46371 0.46603 1 N 0.47956 2 N NRSF MA0138.2 REST 0.48031 0.48299 0.49070 3 Y 0.49522 3 N ERalpha a MA0112.2 ESR1 0.53980 0.49058 0.52414 3 N 0.52146 1 N STAT1 MA0137.2 STAT1 0.55348 0.58555 0.61733 1 N 0.62338 1 Y CTCF 0.60370 0.60377 0.63785 0.64769 232 CTCF (C-20)MA0139.1 CTCF 0.44108 0.44696 0.53181 2 Y 0.54306 2 Y CTCF (SC-5916) 0.46729 0.47047 0.54097 0.55028 FOXA1 (C-20) 0.48083 0.48698 0.48994 0.49853 MA0148.1 FOXA1 3 Y 3 N FOXA1 (SC-101058) 0.48897 0.48326 0.49945 0.50986 EBF 0.50011 0.51202 0.56084 0.59172 MA0154.1 EBF1 3 Y 3 N EBF1 (C-8) 0.42214 0.43705 0.52067 0.53207 FOXA2 (SC-6554) MA0047.2 Foxa2 0.48328 0.39496 0.45500 3 Y 0.47906 3 N STAT3 MA0144.1 Stat3 0.39145 0.33052 0.38094 3 Y 0.43807 3 Y POU5F1 (SC-9081) MA0142.1 Pou5f1 0.42151 0.42793 0.40855 3 N 0.45449 3 N 1 Subspace 1: R4l (16l 12) 2: R − (36l 48) 3: R − For the NPV and ODV methods, the best weight and subspace combination was found by 5-fold cross-validation on the JASPAR TFBSs, while ﬂanking genomic sequences of the TFBSs were the sources of negative binding sites.

To assess the 4 compared methods, we considered the part of a ROC curve where FPR is at most 0.01 and calculated the AUC scaled to between 0 and

1. This is nearly equivalent to allowing at most 10 false positive hits per promoter on average. As a peak spans about 200 bases, it is considered recalled when it fully contains a predicted binding site. Similarly, a predicted binding site must be fully covered by a peak to be a true positive hit.

In Table 3.3.19, we observe that ODV, NPV, ULPB and PSSM produced the best AUC on 13, 1, 1 and 3 out of 18 tests, respectively. Statistical tests showed that ODV signiﬁcantly outperformed the other 3 methods (p- values 0.0028), NPV signiﬁcantly outperformed ULPB and PSSM (p-values ≤ 0.0449), and ULPB and PSSM are comparable (p-value: 0.433). We notice ≤ that both NPV and ODV performed worse than the other two methods on

MEF2A. As NPV and ODV both sample negative examples from ﬂanking sequences of TFBSs, we suspect that this is one example where the ﬂanking sequences do not represent well the entire promoters. ODV performed consistently across tests corresponding to the same JASPAR ID such as the three for CTCF. Examining the best weight and subspace, we can see that

233 the subspace agrees on 11 out of 14 TF models, while the weight agrees on only 7 of them. The latter may be because ODV optimizes the β vector and hence is less sensitive to the weight used to embed an l-mer.

3.4 Conclusions

In this chapter, we proposed to search for transcription factor binding sites in vector spaces. The novel NPV and ODV methods were introduced to construct a query vector to search for binding sites of a TF. We compared our methods to a state-of-the-art method, the ULPB method, and the widely- used PSSM method. Cross-validation experiments revealed that the NPV and ODV methods significantly outperformed the ULPB and PSSM methods on prokaryotic as well as eukaryotic TF binding sties. Independent validation on human ChIP-seq data further verified that the NPV and ODV methods are significantly better than the other compared methods.

One of the advantages of our framework is that it allows one to easily search for binding sites in various subspaces. Hence, one can search in the best subspace for each individual TF since one can hardly ﬁnd an optimal subspace for all the TFs. Another advantage is that under the proposed framework one can readily identify motif subtypes for a TF. Hence, to exploit this advantage, we introduced the kNPV and kODV methods, immediate variants

234 of the NPV and ODV methods. We demonstrated that, consistent with results in previous studies, kNPV (kODV) signiﬁcantly improved NPV (ODV) on the two data sets.

Our future work aims for extending our proposed methods to handling known binding sites of variable lengths. We will seek to approach this problem without resorting to multiple sequence alignment, which is notoriously time-consuming. In the meantime, we will also seek to identify additional promising subspaces to search for TF binding sites in.

235

Bibliography

[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Ad- dison Wesley, 1st edition, May 1999.

[2] T. L. Bailey, M. Boden, F. A. Buske, M. Frith, C. E. Grant, L. Clementi, J. Ren, W. W. Li, and W. S. Noble. Meme suite: tools for motif discovery and searching. Nucleic Acids Research, 37(suppl 2):W202–W208, 2009.

[3] T. L. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pages 28–36, 1994.

[4] S. Balla, V. Thapar, S. Verma, T. Luong, T. Faghri, C.-H. Huang, S. Ra- jasekaran, J. J. del Campo, J. H. Shinn, W. A. Mohler, M. W. Maciejewski, M. R. Gryk, B. Piccirillo, S. R. Schiller, and M. R. Schiller. Minimo- tif miner: a tool for investigating protein function. Nature methods, 3(3):175–177, March 2006.

[5] Y. Barash, G. Bejerano, and N. Friedman. A simple hyper-geometric approach for discovering putative transcription factor binding sites. In WABI ’01: Proceedings of the First International Workshop on Algorithms in Bioinformatics, pages 278–293, London, UK, 2001. Springer-Verlag.

[6] M. F. Berger and M. L. Bulyk. Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding speciﬁcities of transcription factors. Nature protocols, 4(3):393–411, 2009.

[7] D. P. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, Belmont, MA, 1999.

[8] C. Bi, J. Leeder, and C. Vyhlidal. A comparative study on computational two-block motif detection: algorithms and applications. Mol Pharm, 5(1):3–16, 2008.

[9] S. G. Blumenthal, G. Aichele, T. Wirth, A. P. Czernilofsky, A. Nordheim, and J. Dittmer. Regulation of the human interleukin-5 promoter by ets

237 transcription factors: Ets1 and ets2, but not elf-1, cooperate with gata3 and htlv-i tax1. Journal of Biological Chemistry, 274(18):12910–12916, 1999.

[10] J. C. Bryne, E. Valen, M.-H. E. Tang, T. Marstrand, O. Winther, I. da Piedade, A. Krogh, B. Lenhard, and A. Sandelin. Jaspar, the open access database of transcription factor-binding proﬁles: new content and tools in the 2008 update. Nucleic Acids Research, 36(suppl 1):D102–D106, 2008.

[11] J. Buhler and M. Tompa. Finding motifs using random projections. In RECOMB ’01: Proceedings of the ﬁfth annual international conference on Computational biology, pages 69–76, New York, NY, USA, 2001. ACM.

[12] K. Cartharius, K. Frech, K. Grote, B. Klocke, M. Haltmeier, A. Klin- genhoﬀ, M. Frisch, M. Bayerlein, and T. Werner. Matinspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics, 21(13):2933–2942, 2005.

[13] D. S. Chekmenev, C. Haid, and A. E. Kel. P-match: transcription factor binding site search by combining patterns and weight matrices. Nucleic Acids Research, 33(suppl 2):W432–W437, 2005.

[14] T. E. P. Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414):57–74, Sept. 2012.

[15] G. E. Crooks, G. Hon, J.-M. Chandonia, and S. E. Brenner. Weblogo: A sequence logo generator. Genome Research, 14(6):1188–1190, June 2004.

[16] E. V. Davydov, D. L. Goode, M. Sirota, G. M. Cooper, A. Sidow, and S. Batzoglou. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS computational biology, 6(12):e1001025+, Dec. 2010.

[17] M. J. de Hoon, S. Imoto, J. Nolan, and S. Miyano. Open source clustering software. Bioinformatics, 20(9):1453–1454, June 2004.

[18] H. T. Do and D. Wang. Overlap-based similarity metrics for motif search in dna sequences. In ICONIP ’09: Proceedings of the 16th Interna- tional Conference on Neural Information Processing, pages 465–474, Berlin, Heidelberg, 2009. Springer-Verlag.

[19] T. R. Dreszer, D. Karolchik, A. S. Zweig, A. S. Hinrichs, B. J. Raney, R. M. Kuhn, L. R. Meyer, M. Wong, C. A. Sloan, K. R. Rosenbloom, G. Roe, B. Rhead, A. Pohl, V. S. Malladi, C. H. Li, K. Learned, V. Kirkup, F. Hsu, R. A. Harte, L. Guruvadoo, M. Goldman, B. M. Giardine, P. A.

238 Fujita, M. Diekhans, M. S. Cline, H. Clawson, G. P. Barber, D. Haussler, and W. James Kent. The ucsc genome browser database: extensions and updates 2011. Nucleic Acids Research, 40(D1):D918–D923, 2012.

[20] S. R. Eddy. A probabilistic model of local sequence alignment that simpliﬁes statistical signiﬁcance estimation. PLoS computational biology, 4(5):e1000069+, May 2008.

[21] J. Ernst, H. L. Plasterer, I. Simon, and Z. Bar-Joseph. Integrating multiple evidence sources to predict transcription factor binding in the human genome. Genome Research, 20(4):526–536, 2010.

[22] P. J. Farnham. Insights from genomic proﬁling of transcription factors. Nat Rev Genet, 10(9):605–616, 2009.

[23] T. Fawcett. An introduction to roc analysis. Pattern Recogn. Lett., 27:861–874, June 2006.

[24] E. Fazius, V.Shelest, and E. Shelest. Sitar: a novel tool for transcription factor binding site prediction. Bioinformatics, 27(20):2806–2811, 2011.

[25] P.M. Fordyce, D. Pincus, P.Kimmig, C. S. Nelson, H. El-Samad, P.Wal- ter, and J. L. DeRisi. Basic leucine zipper transcription factor hac1 binds dna in two distinct modes as revealed by microﬂuidic analyses. Pro- ceedings of the National Academy of Sciences, 109(45):E3084–E3093, 2012.

[26] P. A. Fujita, B. Rhead, A. S. Zweig, A. S. Hinrichs, D. Karolchik, M. S. Cline, M. Goldman, G. P. Barber, H. Clawson, A. Coelho, M. Diekhans, T. R. Dreszer, B. M. Giardine, R. A. Harte, J. Hillman-Jackson, F. Hsu, V. Kirkup, R. M. Kuhn, K. Learned, C. H. Li, L. R. Meyer, A. Pohl, B. J. Raney, K. R. Rosenbloom, K. E. Smith, D. Haussler, and W. J. Kent. The ucsc genome browser database: update 2011. Nucleic Acids Research, 39(suppl 1):D876–D882, 2011.

[27] S. Gama-Castro, V.Jimenez-Jacinto,´ M. Peralta-Gil, A. Santos-Zavaleta, M. I. Penaloza˜ Spinola, B. Contreras-Moreira, J. Segura-Salazar, L. Muniz˜ Rascado, I. Mart´ınez-Flores, H. Salgado, C. Bonavides-Mart´ınez, C. Abreu- Goodger, C. Rodr´ıguez-Penagos, J. Miranda-R´ıos, E. Morett, E. Merino, A. M. Huerta, L. Trevino˜ Quintanilla, and J. Collado-Vides. Regu- lonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Research, 36(suppl 1):D120–D124, 2008.

239 [28] B. Georgi and A. Schliep. Context-speciﬁc independence mixture modeling for positional weight matrices. Bioinformatics, 22(14):e166–e173, 2006.

[29] S. Georgiev,A. Boyle, K. Jayasurya, X. Ding, S. Mukherjee, and U. Ohler. Evidence-ranked motif identiﬁcation. Genome Biology, 11(2):R19, 2010.

[30] D. G. Gilbert. eugenes: a eukaryote genome information system. Nucleic Acids Research, 30(1):145–148, 2002.

[31] G. E. Gonye, P. Chakravarthula, J. S. Schwaber, and R. Vadigepalli. From promoter analysis to transcriptional regulatory network prediction using paint. 408, August 2007.

[32] O. L. Griﬃth, S. B. Montgomery,B. Bernier, B. Chu, K. Kasaian, S. Aerts, S. Mahony, M. C. Sleumer, M. Bilenky, M. Haeussler, M. Griﬃth, S. M. Gallo, B. Giardine, B. Hooghe, P. Van Loo, E. Blanco, A. Ticoll, S. Lith- wick, E. Portales-Casamar, I. J. Donaldson, G. Robertson, C. Wadelius, P. De Bleser, D. Vlieghe, M. S. Halfon, W. Wasserman, R. Hardison, C. M. Bergman, S. J. Jones, and T. O. R. A. Consortium. Oreganno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Research, 36(suppl 1):D107–D113, 2008.

[33] S. Gupta, J. Stamatoyannopoulos, T. Bailey, and W. Noble. Quantifying similarity between motifs. Genome Biology, 8(2):R24, 2007.

[34] S. Hannenhalli. Eukaryotic transcription factor binding sites–modeling and integrative search methods. Bioinformatics, 24(11):1325–1331, 2008.

[35] S. Hannenhalli and L.-S. Wang. Enhanced position weight matrices using mixture models. Bioinformatics, 21(suppl 1):i204–212, 2005.

[36] G. Z. Hertz and G. D. Stormo. Identifying dna and protein patterns with statistically signiﬁcant alignments of multiple sequences. Bioinfor- matics, 15(7):563–577, 1999.

[37] A. K. Jain. Data clustering: 50 years beyond K-means. Pattern Recog- nition Letters, 31(8):651–666, June 2010.

[38] D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold. Genome-wide mapping of in vivo protein-dna interactions. Science, 316(5830):1497– 1502, 2007.

[39] I. T. Jolliﬀe. Principal Component Analysis. Springer, second edition, October 2002.

240 [40] A. Kel, E. Gling, I. Reuter, E. Cheremushkin, O. Kel-Margoulis, and E. Wingender. MatchTM : a tool for searching transcription factor binding sites in dna sequences. Nucleic Acids Research, 31(13):3576–3579, 2003.

[41] S. M. KieÅbasa, H. Klein, H. G. Roider, M. Vingron, and N. Blthgen. Transﬁnd–predicting transcriptional regulators for gene sets. Nucleic Acids Research, 38(suppl 2):W275–W280, 2010.

[42] T. Kozuka, M. Sugita, S. Shetzline, A. M. Gewirtz, and Y.Nakata. c-myb and gata-3 cooperatively regulate il-13 expression via conserved gata-3 response element and recruit mixed lineage leukemia (mll) for histone modiﬁcation of the il-13 locus. The Journal of Immunology, 187(11):5974– 5982, 2011.

[43] D. L. Kroshko. OpenOpt 0.36. http://openopt.org/, Sept. 2011.

[44] A. Kumar and L. Cowen. Recognition of beta-structural motifs using hidden markov models trained with simulated evolution. Bioinformat- ics, 26(12):i287–i293, 2010.

[45] M. Larkin, G. Blackshields, N. Brown, R. Chenna, P. McGettigan, H. McWilliam, F. Valentin, I. Wallace, A. Wilm, R. Lopez, J. Thomp- son, T. Gibson, and D. Higgins. Clustal w and clustal x version 2.0. Bioinformatics, 23(21):2947–2948, 2007.

[46] C. Lee, A. Abdool, and C.-H. Huang. Pca-based population structure inference with generic clustering algorithms. BMC Bioinformatics, 10(S- 1):S73, 2009.

[47] C. Lee and C.-H. Huang. LASAGNA-Search 2.0: integrated transcription factor binding site search and visualization in a browser. submitted.

[48] C. Lee and C.-H. Huang. Geometric visualization of transcription factor binding sites in context. In Proceedings of the Second ACM Inter- national Conference on Bioinformatics and Computational Biology, BCB ’11, pages 457–461, New York, NY, USA, 2011. ACM.

[49] C. Lee and C.-H. Huang. Searching for transcription factor binding sites in vector spaces. BMC Bioinformatics, 13(1):215, 2012.

[50] C. Lee and C.-H. Huang. LASAGNA: A novel algorithm for transcription factor binding site alignment. BMC Bioinformatics, 14:108, 2013.

[51] C. Lee and C.-H. Huang. LASAGNA-Search: an integrated webtool for transcription factor binding site search and visualization. BioTechniques, 54(3):141–153, 2013.

241 [52] C. Lee, I. Mandoiu, and C. Nelson. Inferring ethnicity from mitochon- drial dna sequence. BMC Proceedings, 5(Suppl 2):S11, 2011.

[53] D. L. Lee, H. Chuang, and K. Seamons. Document ranking and the vector-space model. IEEE Softw., 14:67–75, March 1997.

[54] L. Li. GADEM: A Genetic Algorithm Guided Formation of Spaced Dyads Coupled with an EM Algorithm for Motif Discovery. Journal of Computational Biology, 16(2):317–329, Feb. 2009.

[55] N. Li and M. Tompa. Analysis of computational approaches for motif discovery. Algorithms for Molecular Biology, 1(1):8, 2006.

[56] X. Liu, D. L. Brutlag, and J. S. Liu. Bioprospector: Discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. In Pac. Symp. Biocomput, pages 127–138, 2001.

[57] C. T. Lopes, M. Franz, F. Kazi, S. L. Donaldson, Q. Morris, and G. D. Bader. Cytoscape web: an interactive web-based network browser. Bioinformatics, 26(18):2347–2348, 2010.

[58] V. D. Marinescu, I. S. Kohane, and A. Riva. The mapper database: a multi-genome catalog of putative transcription factor binding sites. Nucleic Acids Research, 33(suppl 1):D91–D97, 2005.

[59] V. Matys, O. V. Kel-Margoulis, E. Fricke, I. Liebich, S. Land, A. Barre- Dirrie, I. Reuter, D. Chekmenev, M. Krull, K. Hornischer, N. Voss, P. Stegmaier, B. Lewicki-Potapov, H. Saxel, A. E. Kel, and E. Wingender. R R Transfac and its module transcompel : transcriptional gene regulation in eukaryotes. Nucleic Acids Research, 34(suppl 1):D108–D110, 2006.

[60] D. E. Newburger and M. L. Bulyk. Uniprobe: an online database of protein binding microarray data on proteindna interactions. Nucleic Acids Research, 37(suppl 1):D77–D82, 2009.

[61] C. Notredame. Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol, 3(8):e123, 08 2007.

[62] Y.M. Oh, J. K. Kim, S. Choi, and J.-Y.Yoo. Identiﬁcation of co-occurring transcription factor binding sites from dna sequence using clustered position weight matrices. Nucleic Acids Research, 40(5):e38, 2012.

[63] R. Osada, E. Zaslavsky, and M. Singh. Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinformatics, 20(18):3516–3525, 2004.

242 [64] E. Portales-Casamar, D. Arenillas, J. Lim, M. I. Swanson, S. Jiang, A. McCallum, S. Kirov, and W. W. Wasserman. The pazar database of gene regulatory information coupled to the orca toolkit for the study of regulatory sequences. Nucleic Acids Research, 37(suppl 1):D54–D60, 2009. [65] E. Portales-Casamar, S. Thongjuea, A. T. Kwon, D. Arenillas, X. Zhao, E. Valen, D. Yusuf, B. Lenhard, W. W. Wasserman, and A. Sandelin. Jaspar 2010: the greatly expanded open-access database of transcription factor binding proﬁles. Nucleic Acids Research, 38(suppl 1):D105–D110, 2010. [66] S. Rajasekaran, S. Balla, and C.-H. Huang. Exact algorithms for planted motif problems. Journal of Computational Biology, 12(8):1117–1128, 2005. [67] A. Riva. The MAPPER2 Database: a multi-genome catalog of putative transcription factor binding sites. Nucleic Acids Research, 40(D1):D155– D161, Jan. 2012. [68] K. R. Rosenbloom, T. R. Dreszer, M. Pheasant, G. P. Barber, L. R. Meyer, A. Pohl, B. J. Raney, T. Wang, A. S. Hinrichs, A. S. Zweig, P. A. Fujita, K. Learned, B. Rhead, K. E. Smith, R. M. Kuhn, D. Karolchik, D. Haussler, and W. J. Kent. Encode whole-genome data in the ucsc genome browser. Nucleic Acids Research, 38(suppl 1):D620–D625, 2010. [69] R. A. Salama and D. J. Stekel. Inclusion of neighboring base interdepen- dencies substantially improves genome-wide prokaryotic transcription factor binding site prediction. Nucl. Acids Res., 38(12):e135, 2010. [70] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18:613–620, November 1975. [71] A. Sandelin, W. W. Wasserman, and B. Lenhard. Consite: web-based prediction of regulatory elements using cross-species comparison. Nu- cleic Acids Research, 32(suppl 2):W249–W252, 2004. [72] G. Sandve, O. Abul, V.Walseng, and F. Drablos. Improved benchmarks for computational motif discovery. BMC Bioinformatics, 8(1):193, 2007. [73] T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht. Infor- mation content of binding sites on nucleotide sequences. Journal of Molecular Biology, 188(3):415 – 431, 1986. [74] J. Schug. Using TESS to predict transcription factor binding sites in DNA sequence. In A. D. Baxevanis, editor, Current Protocols in Bioinfor- matics. J. Wiley and Sons, 2003.

243 [75] D. Shalon, S. J. Smith, and P. O. Brown. A DNA microarray system for analyzing complex DNA samples using two-color ﬂuorescent probe hybridization. Genome Research, 6(7):639–645, July 1996.

[76] S. Sinha. Discriminative motifs. In RECOMB ’02: Proceedings of the sixth annual international conference on Computational biology, pages 291–298, New York, NY, USA, 2002. ACM.

[77] R. Staden. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research, 12(1Part2):505–519, 1984.

[78] S. Struckmann, D. Esch, H. Scholer, and G. Fuellen. Visualization and exploration of conserved regulatory modules using rexspecies 2. BMC Evolutionary Biology, 11(1):267, 2011.

[79] K. T. Takusagawa and D. K. Gifford. Negative information for motif discovery. In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pages 360–371, 2004.

[80] M. Thomas-Chollier, C. Herrmann, M. Defrance, O. Sand, D. Thieﬀry, and J. van Helden. Rsat peak-motifs: motif analysis in full-size chip-seq datasets. Nucleic Acids Research, 40(4):e31, 2012.

[81] J. D. Thompson, D. G. Higgins, and T. J. Gibson. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucleic Acids Research, 22(22):4673–4680, 1994.

[82] B. Tokovenko, R. Golda, O. Protas, M. Obolenskaya, and A. El’skaya. Cotrasif: conservation-aided transcription-factor-binding site ﬁnder. Nucleic Acids Research, 37(7):e49, 2009.

[83] J.-V.V.Turatsinze, M. Thomas-Chollier, M. Defrance, and J. van Helden. Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nature protocols, 3(10):1578–1588, Sept. 2008.

[84] A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’06, pages 11–18, New York, NY, USA, 2006. ACM.

[85] J. Vilo, A. Brazma, I. Jonassen, A. Robinson, and E. Ukkonen. Min- ing for putative regulatory elements in the yeast genome using gene expression data. In Proceedings of the Eighth International Conference on

244 Intelligent Systems for Molecular Biology, pages 384–394. AAAI Press, 2000.

[86] J. Wang, M. F. Shannon, and I. G. Young. A role for ets1, synergizing with ap-1 and gata-3 in the regulation of il-5 transcription in mouse th2 lymphocytes. International Immunology, 18(2):313–323, February 2006.

[87] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83, 1945.

[88] C. Wong, Y.Li, C. Lee, and C.-H. Huang. Ensemble learning algorithms for classiﬁcation of mtdna into haplogroups. Brieﬁngs in Bioinformatics, 12(1):1–9, 2011.

[89] C. Yanover, M. Singh, and E. Zaslavsky. M are better than one: an ensemble-based motif ﬁnder and its application to regulatory element prediction. Bioinformatics, 25(7):868–874, 2009.

[90] F. Zambelli, G. Pesole, and G. Pavesi. Pscan: ﬁnding over-represented transcription factor binding site motifs in sequences from co-regulated or co-expressed genes. Nucleic Acids Research, 37(suppl 2):W247–W252, 2009.

[91] E. Zaslavsky and M. Singh. A combinatorial optimization approach for diverse motif ﬁnding applications. Algorithms for Molecular Biology, 1(1):13, 2006.

[92] Y. Zhao and G. D. Stormo. Quantitative analysis demonstrates most transcription factors require only simple models of speciﬁcity. Nature biotechnology, 29(6):480–483, June 2011.

[93] J. Zhu and M. Q. Zhang. Scpd: a promoter database of the yeast saccharomyces cerevisiae. Bioinformatics, 15(7):607–611, 1999.

245