Nucleic Acids Research Advance Access published December 13, 2013 Nucleic Acids Research, 2013, 1–12 doi:10.1093/nar/gkt1249

Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments Pouya Kheradpour1,2 and Manolis Kellis1,2,*

1Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, MA 02139, USA and 2Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, MA 02139, USA

Received August 7, 2013; Revised November 6, 2013; Accepted November 7, 2013 Downloaded from ABSTRACT present in a given condition and cell type or tissue. As these technologies have matured, their use has become Recent advances in technology have led to a increasingly widespread. The resolution of these experi- dramatic increase in the number of available tran- mental techniques can be as low as 300 bp for ChIP-chip scription factor ChIP-seq and ChIP-chip data sets. (5) and 50 bp for ChIP-seq (6), depending on the experi- Understanding the motif content of these data sets mental design (e.g. fragment size, paired-end sequencing) http://nar.oxfordjournals.org/ is an important step in understanding the underlying and algorithmic processing of the raw data. mechanisms of regulation. Here we provide a sys- The use of these technologies on a variety of factors tematic motif analysis for 427 human ChIP-seq data across many cell types has increasingly highlighted the sets using motifs curated from the literature and complex nature of TF activity, often violating the simple also discovered de novo using five established model of a factor binding to its recognition pattern (motif) motif discovery tools. We use a systematic in isolation: binding has been shown to be dynamic across pipeline for calculating motif enrichment in each cell types, requiring the coordinated binding of cofactors data set, providing a principled way for choosing or specific configurations of the underlying chromatin. at MIT Libraries on December 14, 2013 between motif variants found in the literature and Moreover, TF binding frequently occurs in the absence of any discernible motif instance (7,8) or to ‘hot-spots’ for flagging potentially problematic data sets. Our where several factors are simultaneously found (9). analysis confirms the known specificity of 41 of Understanding this complex binding necessitates identify- the 56 analyzed factor groups and reveals motifs ing the underlying sequence features responsible. To of potential cofactors. We also use cell type- address this need, we have performed a systematic, specific binding to find factors active in specific motif-centric analysis of hundreds of TF binding experi- conditions. The resource we provide is accessible ments made available as part of the human ENCODE both for browsing a small number of factors and project (8,10). As part of this, we provide a collection of for performing large-scale systematic analyses. We motifs for each assayed factor, both taken from the litera- provide motif matrices, instances and enrichments ture and through de novo discovery, and also an annota- in each of the ENCODE data sets. The motifs dis- tion of motif instances genome-wide, which may be covered here have been used in parallel studies to used to pinpoint the specific regulatory bases in regions validate the specificity of antibodies, understand bound by TFs. cooperativity between data sets and measure the We found that no single algorithm or database compre- hensively assays the motifs relevant to the binding diver- variation of motif binding across individuals and sity surveyed by ENCODE. Therefore, our approach was species. to collect motifs from several literature sources (11–16) and supplement them with motifs discovered de novo on the data sets themselves using five established tools INTRODUCTION (17–21). Although this general approach of using Chromatin immunoprecipitation (ChIP) (1) followed by multiple motif discovery tools is popular [e.g. (22–24)], hybridization to an array (ChIP-chip) (2,3) or sequencing its application to this number of data sets is unprecedented (ChIP-seq) (4) enables the genome-wide identification of and permits the identification of TFs that are likely to be the binding locations of transcription factors (TFs) interacting or participating in common pathways.

*To whom correspondence should be addressed. Tel: +1 617 253 2419; Fax: +1 617 452 5034; Email: [email protected]

ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] 2 Nucleic Acids Research, 2013

This work is accompanied by a web interface for pipeline, or are due to real, but relatively weak, specificity browsing the discovered and literature motifs along with for the factor. We also exclude an additional 36 motifs their enrichments (Figure 1; http://compbio.mit.edu/ that have a weak similarity to the known motif for the -motifs). In addition to the browsing interface, we factor but for which a better matching and enriched motif provide several data files including all motif matrices and is also found (Supplementary Table S2). These are most their matches to the genome, as well as software to frequently seen for longer motifs that can be broken up compute enrichments and perform unified motif discovery into recognizable, but globally dissimilar, patterns that are with the five tools we use. Together, these permit both not captured by our automatic exclusion criteria (see analyses of individual factors (e.g. to identify cooperating Supplementary Methods). Together, these represent 28% TFs) in addition to systematic analysis (e.g. to examine of the 293 discovered motifs. differences between TFs). Moreover, the breadth of data sets available enables systematic comparisons and analyses that are not possible when only one or a few RESULTS factors are studied in isolation. Using motif similarity metrics, we are able to link the dis- Later in the text, we describe the details of how the covered motifs directly to the TFs that recognize them resource was generated and conduct an initial analysis to through their known motifs. Here we use these inferred provide examples of its usage and to highlight potentially relationships between TFs to make specific biological Downloaded from interesting results. insights, illustrating the types of analyses that our resource enables. In the interest of clarity, most descrip- tions of TFs will be omitted, but may be found along with MATERIALS AND METHODS further references at RefSeq (28) and (29).

Our goals were to produce a resource that (i) contains a http://nar.oxfordjournals.org/ comprehensive collection of relevant motifs for each Recovery of known specificity for TFs factor; (ii) avoids repetitive, weakly enriched motifs that Most of the known literature motifs we collect are derived do not contribute to the in vivo specificity of the factor or from biochemical in vitro assays. Thus, they provide a its partners; and (iii) excludes variants of the same motif, largely independent, although somewhat imperfect way particularly among the discovered motifs. With this in to evaluate the performance of our discovered motifs. mind, we conducted motif discovery separately on each Recovery of known motifs varies significantly by data set using five motif discovery tools and manually method, but taking the most enriched motif (our

placed all its data sets into ‘factor groups’ on the basis pipeline) is competitive with the best single method at MIT Libraries on December 14, 2013 of known motifs and homology (Figure 2). Known (Figure 3b). Overall, our pipeline found a motif motifs from the literature and the top 10 most enriched matching a previously characterized literature motif for discovered motifs (excluding duplicates) were collected for 41 of the 56 factor groups with a known motif. each factor group (see Supplementary Methods) and One of the most striking observations of this analysis is named as TF_known# for known motifs and TF_disc# how frequently other distinct motifs were also found. For for discovered motifs, where TF denotes the factor 29 of these 41 factor groups other motifs are found, even group (e.g. FOXA, CTCF, etc.). Known motifs were after manually excluding redundant or repetitive motifs, ordered arbitrarily, whereas the discovered motifs were and for 9 factor groups one or more of these discovered ordered in descending order of the enrichment value that motifs is ranked higher than the motif matching a known was used for their selection. motif (see Supplementary Table S3). In the next section, The 427 ENCODE experiments analyzed correspond to we will analyze the additional motifs we found for these 123 TFs, which we place into 84 factor groups (Figure 3a). factors, which in many cases identify factors known to We failed to discover an enriched motif for only 12 of the interact, either cooperatively or competitively. 84 factor groups, of which 9 lack DNA binding domains For the remaining 15 of 56 factor groups with a known (BRF, CTBP2, HDAC8, KAT2A, NELFE, SUPT20H, motif (e.g. HSF, NANOG, PBX3, SREBP and TAL1) the SUZ12, WRNIP1 and XRCC4) as identified by UniProt known motif is not found at all, including NR4A where (27), and 6 have all their data sets flagged as unreliable no enriched motif is discovered. Frequently this is because based on various quality metrics [BRF, KAT2A, NELFE, the known motif itself is not enriched and may not accur- NR4A, SUPT20H and ZZZ3; see (A. Kundaje, L.Y. Jung, ately capture the specificity of the factor in vivo. For P.V. Kharchenko, B. Wold, A. Sidow, S. Batzoglou and example, the ‘known’ EP300 motif from Transfac was P.J. Park, in preparation)]. Of these factor groups, only likely built on a specific bound region of EP300 and NR4A has a previously identified known motif. would not accurately capture its binding in all cell types We exclude from the discussion below motifs that we where it interacts with a variety of factors and has no consider unlikely to be relevant to our analysis, while DNA binding domain of its own (we avoided removing maintaining them as part of the overall resource where such motifs to prevent bias in the database). Likewise, we they may be useful. These include 46 discovered motifs do not discover a motif that matches the known ZBTB33 that are either low-complexity (e.g. dinucleotide repeats) specificity, and moreover the known motif itself is not or consistently have weak enrichment (<2) and do not enriched at all in the bound regions. match known motifs (Supplementary Table S1). These Although some known motifs were of apparently low are likely a consequence of slight biases in the discovery quality, we largely found our database of known motifs to (a) (b)(c) FOXA_disc1

FOXA_disc2

FOXA_disc3 FOXA_disc3 FOXA_known6 FOXJ1_1 FOXI1_2 FOXI1_1 FOXC1_4 STAT_known8 FEV_1 SPI1_known2 STAT_disc5 TCF7L1_2 HNF4_known1 FOXA_disc1 FOXA_known5 FOXA_known7 FOXA_known4 FOXA_known3 FOXA_known2 FOXA_known1 FOXA_disc2 FOXA_disc5 FOXA_disc4 RARA_1 NR2F2_2 NR4A_known1 HNF4_known26 NR3C1_disc5 TCF12_known1 HNF4_disc4 FOXD3_3 FOXD2_1 FOXC2_1 ARID5A_1 FOXA_disc4 FOXJ2_1 FOXA_disc3 1.0 0.5 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.3 0.5 0.4 0.4 0.3 0.2 0.4 0.4 0.4 0.5 1.0 0.8

FOXA_disc5 FOXA_known6 0.5 1.0 1.0 1.0 0.9 0.9 0.8 0.8 0.8 0.7 0.4 0.3 0.7 0.6 0.7 0.8 0.8 0.7 0.6 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.4 0.3 0.4 0.4 0.4 0.5 0.4 0.2

FOXA_disc1 0.4 1.0 1.0 1.0 0.9 0.9 0.8 0.9 0.8 0.7 0.4 0.3 0.8 0.6 0.7 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.5 0.5 0.4 0.3 0.4 0.3 0.4 0.4 0.4 0.4 0.3 0.2 FOXA_known1 FOXA_known5 0.4 1.0 1.0 1.0 0.9 0.9 0.9 0.9 0.8 0.6 0.4 0.3 0.8 0.7 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.5 0.4 0.4 0.3 0.4 0.4 0.3 0.4 0.4 0.4 0.3 0.3

FOXA_known2 FOXA_known7 0.4 0.9 0.9 0.9 1.0 0.8 0.8 0.8 0.8 0.7 0.4 0.3 0.6 0.8 0.7 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.5 0.4 0.4 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.3 0.1

FOXA_known4 0.4 0.9 0.9 0.9 0.8 1.0 0.9 0.9 0.9 0.6 0.4 0.3 0.8 0.8 0.8 0.8 0.9 0.7 0.7 0.7 0.7 0.6 0.5 0.4 0.4 0.3 0.5 0.4 0.3 0.3 0.3 0.3 0.3 0.2 FOXA_known3 FOXA_known3 0.3 0.8 0.8 0.9 0.8 0.9 1.0 1.0 0.9 0.5 0.4 0.3 0.8 0.9 0.9 0.9 0.9 0.6 0.6 0.6 0.6 0.5 0.5 0.4 0.4 0.4 0.5 0.4 0.3 0.2 0.3 0.2 0.2 0.2

FOXA_known4 FOXA_known2 0.3 0.8 0.9 0.9 0.8 0.9 1.0 1.0 0.9 0.6 0.4 0.4 0.8 0.8 0.9 0.9 0.9 0.6 0.6 0.6 0.6 0.5 0.5 0.4 0.4 0.4 0.5 0.4 0.3 0.3 0.3 0.2 0.2 0.2

FOXA_known1 0.3 0.8 0.8 0.8 0.8 0.9 0.9 0.9 1.0 0.7 0.4 0.3 0.7 0.8 0.8 0.9 0.9 0.7 0.7 0.7 0.7 0.6 0.5 0.4 0.4 0.3 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.1 FOXA_known5 FOXA_disc2 0.3 0.7 0.7 0.6 0.7 0.6 0.5 0.6 0.7 1.0 0.4 0.4 0.5 0.3 0.5 0.5 0.6 0.9 0.9 0.9 0.9 0.8 0.4 0.4 0.4 0.4 0.3 0.4 0.4 0.4 0.4 0.4 0.2 0.2

FOXA_known6 FOXA_disc5 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 1.0 0.5 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.4 0.3 0.4 0.5 0.8 0.8 0.8 0.8 0.8 0.8 0.3 0.2 FOXA_known7 FOXA_disc4 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3 0.4 0.5 1.0 0.4 0.3 0.4 0.4 0.3 0.4 0.4 0.4 0.4 0.5 0.8 0.8 0.8 1.0 0.4 0.5 0.5 0.5 0.4 0.4 0.4 0.2 FOX_1 FOXD3_1 FOXO6_2 FOXK1_4 FOXB1_3 SRY_3 FOXA_disc3 FOXA_known6 FOXA_disc1 FOXA_known1 FOXA_disc2 FOXA_disc4 HNF4_disc4 FOXD3_3 FOXD2_1 FOXC2_1 FOXC1_4 ARID5A_1 HNF4_known1 HNF4_known6 HNF4_known13 HNF4_known2 RXRA_disc1 FOXD3_2 FOXC2_2 FOXC1_6 FOXK1_1 FOXL1_3 FOXJ3_1 FOXF2_2 FOXO1_2 FOXG1_5 FOXO1_3 FOXO4_4 FOXI1_3 FOXL1_4 FOXP3_2 FOXJ2_4 FOXC1_5 FOXQ1_2 FOXA_known3 FOXA_known2 FOXA_disc5 FOXJ1_1 FOXI1_2 FOXI1_1 FOXJ2_1 FOXB1_2 STAT_known8 FEV_1 SPI1_known2 STAT_disc5 TCF7L1_2 HNF4_known9 NR4A_known4 RARA_1 HNF4_known26 TCF12_known1 FOXJ2_5 FOXF2_1 FOXD1_1 FOXC2_3 FOXJ3_2 FOXO3_1 FOXO4_2 FOXO3_3 FOXD3_4 FOXD2_2 FOXC1_7 FOXO4_1 FOXC1_1 FOXF1_1 FOXQ1_1 FOXA_known5 FOXA_known7 FOXA_known4 NR2F2_2 NR4A_known1 NR3C1_disc5 FOXL1_5 FOXC1_3 FOXO3_2 FOXJ1_2 FOXD1_2 FOXO3_5 FOXB1_4 TCF12_disc2 FOXJ3_7 HDAC2_disc2 EP300_disc3 FOXO1_1 FOXL1_1 FOXA_disc3 1.0 0.5 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.3 0.3 0.4 0.4 0.4 0.4 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.6 0.6 0.5 0.5 0.5 0.5 0.5 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.4 0.4 0.5 0.5 0.3 0.4 0.4 0.3 0.3 0.2 0.2 0.2 0.2 0.3 0.5 0.4 0.4 0.3 0.2 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.5 1.0 0.8 FOXA_known6 0.5 1.0 1.0 1.0 0.9 0.9 0.8 0.8 0.8 0.7 0.4 0.3 0.7 0.6 0.7 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.7 0.7 0.7 0.9 0.8 0.9 0.9 0.9 0.9 0.9 0.8 0.8 0.9 0.9 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.8 0.8 0.8 0.9 0.9 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.4 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.4 0.2 (e) All FOXA_disc1 0.4 1.0 1.0 1.0 0.9 0.9 0.8 0.9 0.8 0.7 0.4 0.3 0.8 0.6 0.7 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.7 0.7 0.7 0.7 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.8 0.8 0.8 0.9 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.8 0.9 0.9 0.9 0.9 0.8 1.0 1.0 0.8 0.8 0.8 0.9 0.9 0.7 0.6 0.6 0.6 0.6 0.4 0.5 0.5 0.4 0.3 0.4 0.3 0.4 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.2 FOXA_known5 0.4 1.0 1.0 1.0 0.9 0.9 0.9 0.9 0.8 0.6 0.4 0.3 0.8 0.7 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.8 0.9 0.9 0.9 0.9 0.9 1.0 1.0 0.8 0.8 0.7 0.9 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.5 0.4 0.4 0.3 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.3 0.3 Near FOXA_known7 0.4 0.9 0.9 0.9 1.0 0.8 0.8 0.8 0.8 0.7 0.4 0.3 0.6 0.8 0.7 0.8 0.8 0.8 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1.0 1.0 1.0 0.9 0.9 0.9 1.0 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.8 0.9 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.8 0.8 0.9 0.9 0.9 0.7 0.6 0.6 0.6 0.6 0.4 0.5 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.1 FOXA_known4 0.4 0.9 0.9 0.9 0.8 1.0 0.9 0.9 0.9 0.6 0.4 0.3 0.8 0.8 0.8 0.8 0.9 0.9 0.8 0.8 0.8 0.8 0.8 0.8 0.7 0.8 0.8 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.8 0.8 0.8 0.9 0.8 0.7 0.7 0.7 0.7 0.7 0.6 0.5 0.4 0.4 0.3 0.5 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.2 5.6 FOXA_known3 0.3 0.8 0.8 0.9 0.8 0.9 1.0 1.0 0.9 0.5 0.4 0.3 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.8 0.8 0.9 0.8 0.8 0.6 0.6 0.7 0.8 0.8 0.8 0.8 0.8 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.8 0.8 0.7 0.8 0.7 0.6 0.6 0.6 0.6 0.6 0.5 0.5 0.4 0.4 0.4 0.5 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.3 0.2 0.2 0.2 TSS Distal FOXA_known2 0.3 0.8 0.9 0.9 0.8 0.9 1.0 1.0 0.9 0.6 0.4 0.4 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.8 0.9 0.8 0.8 0.8 0.8 0.8 0.8 0.7 0.6 0.6 0.7 0.8 0.8 0.7 0.7 0.8 0.9 0.8 0.7 0.8 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.8 0.8 0.7 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.5 0.5 0.4 0.4 0.4 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.2 FOXA_known1 0.3 0.8 0.8 0.8 0.8 0.9 0.9 0.9 1.0 0.7 0.4 0.3 0.7 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.8 0.9 0.7 0.7 0.7 0.7 0.6 0.7 0.7 0.8 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.8 0.7 0.7 0.7 0.9 0.8 0.7 0.7 0.7 0.7 0.7 0.6 0.5 0.4 0.4 0.3 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.1

FOXA_disc2 0.3 0.7 0.7 0.6 0.7 0.6 0.5 0.6 0.7 1.0 0.4 0.4 0.5 0.3 0.5 0.5 0.6 0.5 0.6 0.6 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.4 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.6 0.5 0.5 0.5 0.5 0.5 0.6 0.7 0.6 0.7 0.6 0.7 0.7 0.6 0.7 0.7 0.7 0.7 0.6 0.7 0.7 0.6 0.6 0.6 0.7 0.6 0.5 0.5 0.6 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.8 0.4 0.4 0.4 0.4 0.3 0.4 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 FOXA_disc5 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 1.0 0.5 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.4 0.4 0.3 0.4 0.4 0.4 0.5 0.5 0.4 0.4 0.5 0.5 0.4 0.5 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.4 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.5 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.4 0.3 0.4 0.5 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.3 0.2 FOXA_disc4 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3 0.4 0.5 1.0 0.4 0.3 0.4 0.4 0.3 0.3 0.4 0.4 0.3 0.3 0.3 0.3 0.4 0.3 0.3 0.4 0.4 0.3 0.3 0.2 0.3 0.3 0.2 0.2 0.3 0.4 0.4 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.3 0.4 0.2 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.8 0.8 0.8 1.0 0.4 0.5 0.5 0.5 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.2

FOXA1_HepG2_encode-Myers_seq_hsa_v041610.1-C-20 1.0 14 12 9.0 6.4 4.8 4.3 4.6 4.0 5.7 1.6 1.1 2.4 3.3 2.2 3.9 3.2 3.7 1.8 2.0 3.7 3.6 3.4 2.9 1.7 2.9 2.8 1.9 1.1 1.8 5.5 4.8 4.9 6.0 7.0 10 6.5 8.7 2.3 4.7 4.4 3.6 3.5 8.8 7.4 4.7 7.1 7.5 11 7.8 8.4 6.8 3.7 2.3 2.8 8.1 7.4 8.5 5.9 4.4 3.3 5.5 5.2 1.5 1.3 1.1 2.6 4.6 5.7 3.6 4.4 4.4 3.6 1.9 2.7 2.8 2.2 2.9 1.0 1.3 FOXA1_HepG2_encode-Myers_seq_hsa_v041610.1-SC-101058 1.1 14 12 9.6 6.6 4.9 4.5 4.7 4.2 5.6 1.7 1.1 2.5 3.3 2.2 4.1 3.2 3.9 1.8 2.1 3.8 3.7 3.6 3.1 1.7 2.9 2.9 1.9 1.2 1.8 5.8 5.0 5.2 6.0 7.3 10 6.8 8.7 2.4 4.8 4.6 3.7 3.7 9.4 7.7 4.6 7.4 7.9 11 7.9 8.8 7.0 3.7 2.3 2.9 8.3 7.6 8.8 6.1 4.5 3.3 5.7 5.2 1.5 1.3 1.1 2.9 4.9 5.7 3.7 4.6 4.5 3.7 1.9 2.6 2.8 2.5 2.8 1.0 1.2 FOXA2_HepG2_encode-Myers_seq_hsa_v041610.1-SC-6554 1.1 16 13 11 7.2 5.6 5.1 5.6 4.7 6.6 1.7 1.1 2.9 3.6 2.3 4.5 3.5 4.2 1.9 2.1 4.1 3.9 3.7 3.3 1.6 2.8 3.0 2.0 1.2 1.9 6.3 5.3 5.5 6.6 7.6 11 7.5 9.3 2.5 5.3 4.8 3.8 3.6 9.7 7.8 5.0 7.5 8.4 13 8.4 9.1 7.9 3.9 2.3 3.0 9.0 8.3 9.8 7.1 5.4 4.0 6.9 6.2 1.7 1.1 1.1 2.9 4.8 5.8 3.8 4.7 4.7 3.8 1.9 2.9 3.0 2.4 3.0 1.0 1.3 FOXA1_T-47D_encode-Myers_seq_hsa_DMSO-0.02pct-v041610.2-C-20 1.0 14 12 10 7.0 5.2 4.3 4.4 4.0 5.5 0.9 1.4 2.1 3.4 2.2 4.6 3.4 3.5 1.9 2.2 3.6 3.9 3.7 3.2 1.8 3.2 2.7 1.8 1.1 2.0 6.0 5.1 5.3 6.0 7.2 9.5 6.7 9.3 2.7 5.1 5.2 4.0 4.2 10 8.9 4.5 8.6 8.2 11 9.6 10 6.9 3.8 2.5 3.1 8.3 7.2 8.7 5.8 4.2 3.3 5.3 5.0 1.5 1.2 1.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.2 FOXA1_ECC-1_encode-Myers_seq_hsa_DMSO-0.02pct-v041610.2-C-20 1.8 20 20 13 9.3 6.6 4.6 5.1 4.6 2.5 1.0 1.5 1.9 2.8 2.1 4.8 3.0 4.5 1.4 1.7 3.9 3.9 3.5 3.1 1.3 2.6 2.8 1.5 1.2 1.6 7.6 6.0 7.0 7.0 9.6 15 8.4 12 2.9 6.0 6.8 5.1 4.3 11 13 3.3 13 9.1 15 16 14 8.2 4.0 2.8 3.6 12 9.1 9.9 3.1 1.8 1.3 2.1 2.0 1.0 1.1 1.9 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.5 1.4

(d) FOXA1_HepG2_encode-Myers_seq_hsa_v041610.1-C-20 1.0 14 12 9.0 6.4 4.8 4.3 4.6 4.0 5.7 1.6 1.1 2.4 3.3 2.2 3.9 3.2 4.4 3.3 5.5 5.2 1.5 1.3 1.1 2.6 4.6 2.7 2.8 2.2 2.9 1.0 1.3

FOXA1_HepG2_encode-Myers_seq_hsa_v041610.1-SC-101058 1.1 14 12 9.6 6.6 4.9 4.5 4.7 4.2 5.6 1.7 1.1 2.5 3.3 2.2 4.1 3.2 4.5 3.3 5.7 5.2 1.5 1.3 1.1 2.9 4.9 2.6 2.8 2.5 2.8 1.0 1.2

FOXA2_HepG2_encode-Myers_seq_hsa_v041610.1-SC-6554 1.1 16 13 11 7.2 5.6 5.1 5.6 4.7 6.6 1.7 1.1 2.9 3.6 2.3 4.5 3.5 5.4 4.0 6.9 6.2 1.7 1.1 1.1 2.9 4.8 2.9 3.0 2.4 3.0 1.0 1.3 2013 Research, Acids Nucleic FOXA1_T-47D_encode-Myers_seq_hsa_DMSO-0.02pct-v041610.2-C-20 1.0 14 12 10 7.0 5.2 4.3 4.4 4.0 5.5 0.9 1.4 2.1 3.4 2.2 4.6 3.4 4.2 3.3 5.3 5.0 1.5 1.2 1.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.2 FOXA1_ECC-1_encode-Myers_seq_hsa_DMSO-0.02pct-v041610.2-C-20 1.8 20 20 13 9.3 6.6 4.6 5.1 4.6 2.5 1.0 1.5 1.9 2.8 2.1 4.8 3.0 1.8 1.3 2.1 2.0 1.0 1.1 1.9 1.0 1.0 1.0 1.0 1.0 1.0 1.5 1.4

Figure 1. Pipeline output for FOXA factor group, an example that highlights different aspects of the resource (in the interest of space, only selected columns are shown in the enlargements). (a) The known and discovered motifs for the factor group, drawn with WebLogo (25). Because the original orientation is arbitrary, motifs are flipped as necessary to increase the similarity of the displayed orientations. (b and c) Similarity between the motifs for this factor group and the motifs: (b) for this factor group, (c) for all other motifs that match at least one motif for this factor group with similarity  0:75. Motif similarities are shown in black/white scale with gray starting at 0.65. (d) Enrichments of motifs in the data sets for this factor group. Data sets are named indicating the factor, cell type/stage, lab code, followed by values to differentiate data sets and make them unique. Red is used for enrichment, blue is used for depletion (which is rare in this study). Enrichments are not shown for motifs when control motifs could not be generated or there was not sufficient information content to match at a 4À8 P-value. (e) Magnified enrichment heatmap value. The three triangles represent the background regions used for enrichment; the top triangle is all regions, the left triangle is those within 2 kb of an annotated TSS and the right triangle is those outside the 2 kb window (the regions used in the left and right triangles partition that of the top triangle). The number shown is the enrichment in the inclusive background. Here we see an apparent contradiction: a higher enrichment for the union than the parts. This occurs because the higher counts permit a smaller confidence interval around the ratios used to compute the enrichment. Heatmaps are ordered using hierarchical clustering followed by optimal leaf ordering (26). The Supplementary Results include an analysis of the robustness of the motif similarity

metric and the enrichment values. 3

Downloaded from from Downloaded http://nar.oxfordjournals.org/ at MIT Libraries on December 14, 2013 14, December on Libraries MIT at 4 Nucleic Acids Research, 2013 Downloaded from

Figure 2. Outline of motif discovery pipeline. Input regions for each data set are randomly partitioned into two groups. The top 250 regions of one of the partitions are scanned for motifs using five de novo motif discovery tools. These motifs are evaluated using the peaks from the other partitioned and pooled across data sets for a factor group to produce the final list of discovered motifs for each factor group. http://nar.oxfordjournals.org/

be relatively comprehensive and had difficulty finding Shared motifs suggest interacting relationships matches to novel motifs outside it. An exception is We find that most factors have motifs for other factors ZNF263_disc1, which does not match a motif in our enriched in their binding sites (summarized in database, but does roughly match the specificity for Supplementary Table S4). This may occur due to (i) co- ZNF263 indicated in (30) despite only having weak en- operative binding of the two factors to the same locations; richment (1.8-fold). at MIT Libraries on December 14, 2013 (ii) interfering binding between factors where one binds Although the motifs that match each other (either known or discovered) generally have similar enrichments, near the other to prevent binding; (iii) some similarity in in some cases we find substantially higher enrichment for motif specificity; (iv) the two factors functioning on a some motif variants over others (Figure 4 and similar set of (e.g. ones specific to one tissue), Supplementary Table S3). For example, NFE2_disc1 without directly interacting; or (v) the factors binding to matches the known NFE2 motif, but has a 76-fold similar genomic regions (e.g. near genes). Our analysis maximal enrichment across NFE2 data sets, compared does not directly rule out any of these possibilities; with 56-fold enrichment for the most enriched known however, (iii) is generally verifiable using our motif simi- NFE2 motif. Different known motifs for the same factor larity metrics and (v) can be examined by inspecting only often show a broad range in enrichment: has six the TSS-proximal enrichment. motifs described in Transfac, with an enrichment differen- The motif most enriched in multiple data sets was the tial of as much as 4-fold consistently across data sets. This TPA DNA response element (TRE; TGA[C/G]TCA), enrichment analysis provides a systematic way to choose which is recognized by the AP1 TF when it is formed by among variants of a motif. FOS/JUN dimers (31) and other factors including MAF We also saw varying enrichment of the known motif, and NFE2. The enrichment of the TRE in a data set is depending on the specific data set for a factor group. For often stronger than that of even the known in vitro example, CTCF_known2 is enriched in CTCF data sets in sequence specificity and may arise from a number of phe- a range from 30- to 78-fold on identically processed data. nomena, including (i) a cooperatively interaction with This may be a result of varying quality of the samples AP1, (ii) competition with AP1 for the same binding across data sets or may be a consequence of true biological sites, leading to a potentially repressive role for the TF differences. or (iii) reuse of binding sites due to, for example, accessi- Identifying the sequence specificity for factors that were bility of chromatin. We find a motif matching the TRE previously uncharacterized is of particular interest. In all, motif for 20 factor groups (AP1_disc3, AP2_disc1, 17 factor groups had no known motif but now have dis- BATF_disc1, BCL_disc2, CTCF_disc8-9, EP300_disc1, covered enriched motifs (BCL, BDP1, CCNT2, CHD2, GATA_disc2, HMGN3_disc1, IRF_disc2, MAF_disc1, CTCFL, HDAC2, HMGN3, RAD21, SETDB1, SIRT6, MEF2_disc3, MYC_disc3, NFE2_disc1, NR3C1_disc2, SMARC, SMC3, SP2, SIN3A, THAP1, TRIM28 and PRDM1_disc2, RXRA_disc3, SMARC_disc1, STAT_ ZNF263). These discovered motifs may represent the disc2, TCF7L2_disc1 and TRIM28_disc1). direct or indirect (e.g. through cofactors) DNA binding We found that the enrichment of the TRE to be par- specificity. ticularly notable for a few factors. GATA and AP1 have Nucleic Acids Research, 2013 5

(a) Downloaded from http://nar.oxfordjournals.org/ (b) at MIT Libraries on December 14, 2013

Figure 4. Comparison of known versus discovered motifs (selected where discovered better enriched than known; all factor groups with a discovered motif matching a known motif in Supplementary Table S3). Displayed is the known and discovered motif with the maximum enrichment across all data sets for a factor group. Only the discovered motifs that match a known motif for a factor group are considered. The maximum enrichment is indicated for each factor and, in paren- Figure 3. (a) Summary of input data used. The outside ring indicates thesis, the ‘raw’ enrichment for the same data set without the use of the the experimental data sets (one tick for each of 427), which are shuffle motifs for correction. separated into 123 transcription factors (second ring). The TFs are further grouped into 84 factor groups (third ring). We are able to find a matching discovered motif for 41 of the 56 factor groups with disc1, which matches the TRE, is more enriched than a known motif; 29 of these 41 factor groups have additional discovered the known TCF7L2 motif (TCF7L2_disc2) in only the motifs that may be associated with cofactors. For all but 1 of the 15 factor groups where the known motif is not recovered we still find TCF7L2 colorectal cancer cell line HCF-116 data set, con- enriched discovered motifs. We also discovered enriched motifs for 17 sistent with the known interaction of JUN and TCF7L2 of the 28 factor groups without a known motif. (b) Recovery of known during intestinal cancer development (35). motifs by each of the discovery tools. Performance of discovery in AP1 also binds to the cAMP response element (CRE; T terms of number of factor groups for which the known motif was re- covered. A motif is considered a match if it matches any of the known GACGTCA) when the dimer is formed by ATF3/JUN motifs for a factor group (see Supplementary Methods for details on (31) and this is the motif we find as AP1_disc1. how matches are computed). The number of additional factors that However, AP1_disc3 (which matches the TRE) is the have a match is shown with each additional motif (only three motifs most enriched motif in FOS data sets. Interestingly, are taken from each individual method, whereas we have up to 10 for ATF3_disc1 is not the CRE, but rather the E-box (see the pipeline). The number of factor groups with no motif match is shown in parenthesis. When multiple data sets exist for a factor later in text). We do, however, find a variant of the group, the fraction that matches is used in computing its contribution CRE (with additional specificity) as ATF3_disc2. The for computing the performance of the individual tools. most enriched discovered motif for , E2F_disc1 also matches the CRE and is highly enriched in all data sets. known cooperative binding (32). TFs in the SMARC is a critical regulator, which recognizes the E-box factor group are members of the SWI/SNF chromatin sequence. To aid in comparisons, we include MAX, which remodeling complex (33), which is necessary for proper forms complexes with MYC, and USF1/2, which also rec- regulation by FOS/JUN dimers (34); and TCF7L2_ ognizes the E-box sequence, in the MYC factor group. 6 Nucleic Acids Research, 2013

We find multiple motifs enriched in MYC binding sites, known NFY specificity (CCAAT). These discovered highlighting the multifunctional role MYC and the other motifs are consistent with several known interactions of E-box recognizing play. We found a version of NFY. RFX5 promotes the cooperative binding between the E-box with additional specificity (MYC_disc1) that RFX and NFY (42), CEBPB and NFY interact in at least was highly enriched in USF1/2 bound regions ( 98- one promoter (43) and SP1 and NFY are known to fold for USF2 versus <9-fold enrichment for MYC/ interact (44). E2F_disc4 has particularly high enrichment MAX). This motif was more enriched than the known in data sets, consistent with the cooperative role E-box motifs, including known USF motifs, in many E2F4 and NFY play in cell cycle regulation (45). USF data sets. We find a second, less specific E-box STAT factors are involved in regulating number of motif (MYC_disc2), which shows more even enrichment growth-related functions. We analyze STAT1, STAT2 across factors. We also find discovered motifs of other and STAT3 here in the context of GM12878, HeLa-S3, factors matching the E-box, including SIN3A_disc2 (dis- MCF10A-Er-Src and K562 cells. We find relatively con- cussed later in text), NFE2_disc2-3 and SIRT6_disc1. It is sistent enrichment of the STAT full site (TTCCNGGAA), notable that although SIRT6 is a chromatin-associated which STAT_disc1 matches, while finding weak enrich- without a known DNA binding domain (36), the ment for just the half-site (TTCC). We also find motifs only discovered motif matches the E-box (with 16-fold involved in other proliferative functions including enrichment in SIRT6 bound regions), suggesting that STAT_disc2, which is particularly enriched in STAT3 Downloaded from MYC or another E-box recognizing factor may play an data sets and matches the TRE, consistent with STAT3 important, but indirect, chromatin-related role. being one of the many interaction partners for AP1 (46). Motif enrichment is able to identify both positive and STAT_disc3 matches the IRF consensus and has enrich- negative interactions for the same factor. For example, ment that is particularly high in STAT1 and STAT2 data SIN3A, a corepressor known to interact with a number sets stimulated by IFNa, highlighting the cooperativity of of proteins, has discovered motifs matching REST STAT factors and IRF in immune functions. STAT_disc4 http://nar.oxfordjournals.org/ (SIN3A_disc1 and more weakly disc3–4) and MYC is a match to the CEBPB motif and is found enriched in (SIN3A_disc2). These are consistent with SIN3A’s STAT3 data sets, consistent with the known cooperative known involvement in repression by REST (37) and role for these two factors (47). SIN3A being a known antagonist for MYC (38). TFs with ETS domains are highly conserved and Morever, MYC_disc4 matches RFX5 and is enriched involved in several cellular processes [reviewed in (48)]. particularly for MAX-bound regions in H1-hESC and A number of TFs have discovered motifs that match the GM12878, and MYC_disc5 matches the CEBPB known ETS consensus, including EGR1_disc2, GATA_disc3, motif and is enriched in MYC regions bound in unstimu- MEF2_disc2, NRF1_disc2, NR2C2_disc1 and at MIT Libraries on December 14, 2013 lated K562 cells. MXI1, which was not included in the PAX5_disc4. These discovered motifs are supported by MYC factor group although it does interact with MAX known interactions between GATA and ETS in sea to bind to MYC-MAX sites (39), has MXI1_disc1 that squirts (49), MEF2 and the ETS factor PEA3 (50) and matches RFX5 in both the K562 and HeLa-S3 cell lines. NR2C2 with the ETS factor ELK4 (51). Moreover, We analyzed six IRF family data sets: IRF1 binding in PAX5 and ETS factors have shared roles in the develop- K562 cells stimulated by IFNa (viral innate response) or ment of B-cells (52,53). Looking at the discovered ETS IFNg (viral, bacterial and tumor control); IRF3 in motifs, we find that ETS_disc8 matches the known motif HepG2, GM12878 and HeLa-S3; and IRF4 in for MYB and the two have been known to cooperate, a GM12878. The most strongly enriched motif relationship that is important in the context of certain (IRF_disc1, matching NFY) is highly enriched (>20- cancers (54). fold) for all three IRF3 data sets and IRF1 in K562 THAP1 has two discovered motifs, both of which under IFNg stimulation. This suggests that binding of match the known YY1 motif (the first with additional IRF to NFY sites occurs only under specific conditions specificity added by an apparent HNF4 motif). To our and by only some IRF members and potentially expands knowledge, the relationship between THAP1 and YY1 on the previously documented interaction of NFY and has not been directly observed; however, THAP1 has IRF2 at a single promoter (40). IRF_disc4, which been known to associate with the coactivator HCF-1 matches SP1, is enriched in the same cell types, albeit at (55), and YY1 and HCF-1 are known to interact (56). much lower levels. IRF_disc3, which matches the known Our result suggests that THAP1 and YY1, possibly with IRF consensus, shows weak-to-no enrichment in these the addition of HNF4, may interact at least in the K562 data sets, but shows an enrichment of 8.8-fold for IRF1 cell line for which we have THAP1 binding data. bound regions in K562 cells under IFNa stimulation and RAD21_disc3 also matches YY1, suggesting an additional 3.1-fold enrichment for IRF4 bound regions in GM12878. interaction. IRF_disc2, which matches the TRE, is enriched primarily NANOG, an important pluripotency TF, has a known in GM12878 regions bound by IRF4. The known SPI1 motif that is only weakly enriched (1.3-fold) in the bound motif matches IRF_disc5, and reciprocally SPI1_disc2 regions and not discovered by our pipeline. We see much matches the IRF motif, consistent with the importance stronger enrichment for the known POU5F1 and of SPI1 in hematopoietic development (41). POU2F2 motifs, for which we also find similar motifs Beyond the discovered motif for IRF, several other dis- (NANOG_disc2 and NANOG_disc4, respectively), con- covered motifs (AP1_disc2, CEBP_disc2, E2F_disc4, sistent with their shared roles in pluripotency (57,58). PBX3_disc1, RFX5_disc2 and SP1_disc1-2) match the The interaction of these factors is further supported by Nucleic Acids Research, 2013 7

POU5F1_disc2 matching the known POU2F2 motif. identified by examining enrichments in cell lines-specific Additionally, NANOG_disc2 and disc3 match the data sets. known motifs for TCF7L2 and TCF12, respectively, As a transcriptional coactivator, EP300 interacts with again consistent with the important role TCF proteins numerous TFs [reviewed in (68)] and has been shown to play in stem cells (59). have binding that can identify tissue-specific enhancers CTCF plays a variety of vital roles in the organization (69). Conversely, FOXA has a DNA binding domain of chromatin architecture (60) and the motifs we discover and plays an important role in liver development and matching the known CTCF specificity (RAD21_disc1, function (70) and is a pioneer factor responsible for SMC3_disc1,2-4, CTCFL_disc1,10, ZBTB7A_disc1,2, priming chromatin for the binding of other factors SP2_disc3 and RXRA_disc2,5; some weakly) are largely (reviewed in (71)]. Other proteins involved in chromatin compatible with this role. RAD21 is a highly conserved restructuring include HDAC2, which transcriptionally protein involved in DNA double-strand repair (61) known represses through histone deacetylation (72) and to co-localize with CTCF (62). Cohesin, of which SMC3 is HMGN3 (73). Further, two factor groups are directly a subunit, is brought to the chromatin by CTCF (63). involved in transcription including three RNA Pol3 Further, although the function of the CTCF paralog subunits (BDP1, RPC155 and TFIIIC-110) and CCNT2, CTCFL is not completely known, it does appear to be which is involved in the elongation of Pol2 (74). involved in imprinting through interaction with a Eight of these ten factor groups have at least one data Downloaded from histone methyltransferase (64). set in K562 (erythroleukemia cells), and for four of these we discover motifs that match the GATA consensus, Combinations of motifs which is then enriched specifically in the K562 data sets (BCL_disc5, CCNT2_disc1, HDAC2_disc1 and A few of the discovered motifs contain additional specifi- HMGN3_disc2). GATA has a known important role in city or have distinct segments matching multiple motifs. K562 (75), and we also have previously found an associ- http://nar.oxfordjournals.org/ For example, EGR1_disc4 appears to be a combination of ation with GATA motifs and chromatin state-derived en- multiple motifs (EGR1, IKZF1 and a motif), hancers for K562 cells (76). We also find three additional and SETDB1_disc1 contains the ZNF143 core sequence motifs that have enrichment specific to the factor group’s with significant additional specificity. The appearance of K562 data set: BDP1_disc1, a 23-nt motif that contains these motifs suggests highly specific ‘grammars’ for these the STAT consensus; HMGN3_disc1, which matches the motifs that may require specific spacing and orientation of TRE; and TRIM28_disc2, which matches no known motif binding sites for functionality. and may be associated with an uncharacterized regulator

We find several additional enrichments of potential active in this cell line. at MIT Libraries on December 14, 2013 interest. PBX3_disc2 matches the known MEIS1 motif, Likewise, for GM12878, an EBV-mediated consistent with the known cooperative binding of lymphoblastoid cell line, we find three discovered motifs MEIS1 and PBX (65). TAL1_disc1 matches GATA, (BCL_disc4, EP300_disc5 and TCF12_disc4) that match with the potential connection that GATA and TAL1 are the known IRF consensus. IRF4 has been shown to be known to be important in hematopoesis and vascular de- important in the establishment of these cell lines (77), and velopment (66,67). HSF_disc1 matches the known CEBP the family is an important player in immune cells (78). motif and has much higher enrichment in HSF data sets This enrichment is also consistent with our previous (31-fold) compared with the known motifs for HSF (<9- study using epigenetic marks (76), where we found IRF fold). Additionally, EGR1_disc5, HNF4_disc5, to be the strongest enriched motif in GM12878-specific NRF1_disc3, PAX5_disc2, RXRA_disc4/PAX5_disc3 enhancers. We also find GM12878-specific enrichment and SREBP_disc1 match the known motifs for ZIC, for motifs matching NFKB (BCL_disc6) and POU2F2 SOX, SP1, PAX2/PAX3, IRF and RFX5, respectively, (TATA_disc9), consistent with the known biology of suggesting additional previously uncharacterized inter- these factors (79,80). actions. Lastly, we find some motifs that show more am- The motifs we find specifically enriched in HepG2 (liver biguous matches: SMARC_disc2 shows weak similarity to carcinoma) data sets match the known motifs for FOXA homeobox TGTAGT motif, NR2C2_disc2–3 weakly (EP300_disc3, HDAC2_disc2, and TCF12_disc2), HNF4 matches the known HNF4 motif and EGR1_disc3/ (FOXA_disc5 and HDAC2_disc5) and CEBP SETDB1_disc2 matches the repetitive NRF1 motif. (EP300_disc2,6), three key liver regulators (70,81). We find motifs with enrichments specific to H1-hESC, which General factors enriched in cell line-specific key regulators include matches to the pluripotency factor POU2F2 (TATA_disc9), the near universally expressed repressor Factors directly responsible for the establishment of en- REST (BCL_disc3 and HDAC2_disc4) and key metabolic hancers, chromatin restructuring or polymerase recruit- regulator NRF1 (HDAC2_disc4). We find additional cell ment frequently exhibit binding that is highly cell type line-specific enrichments for FOXA_disc3 (TCF12) in specific. Because most of these factors do not have their ECC-1, FOXA_disc4 (STAT) in both T-47D and ECC-1 own sequence specificity, their binding is often correlated and EP300_disc2,6 (CEBP) and EP300_disc4 (ETS) with with that of regulators important for the specific cell line. enrichment in the HeLa-S3 data set. We analyze several such factors (BCL, BDP1, CCNT2, Even for these factors, we find motifs that are consist- EP300, FOXA, HDAC2, HMGN3, TATA, TCF12 and ently enriched across assayed cell lines for a given factor. TRIM28) and find that key cell line regulators can be FOXA_disc1, for example, matches the known FOXA 8 Nucleic Acids Research, 2013 motif, indicating that FOXA’s own motif also plays an (TATA_disc5,7), Novel7 (TRIM28_disc2) and Novel8 important role in its specificity. Most of the motifs we (E2F_disc6). identify for RNA Pol2 machinery (TAF1, GTF2B, Novel1 (using ZBTB33_disc1) is highly enriched in at GTF2F1 and TBP) are enriched in all cell lines, including least one data set for each of the factor groups for which the known TATAAA motif (TATA_known2). Also, it is found (BRCA1, CHD2, ETS, NR3C1 and ZBTB33). TATA_disc1, disc6 and disc8 have consistent enrichment All five factor groups except CHD2 have at least one and match the known motifs for YY1 (which is known to known motif, and for each of these data sets Novel1 is be important in establishing transcription) (82), NFY and more enriched in at least one data set than any known ETS. The top discovered motif BCL_disc1 matches the motif [the result for NR3C1 is questionable because only known ETS motif and is also enriched across data sets. one data set has enrichment and that data set has been Interestingly, we find that the TRE motif is found and independently flagged as problematic; see http://www. enriched in a cell line-specific manner for several factors, encodeproject.org/encode/qualityMetrics.html]. The but for different cell lines. For example, HMGN3_disc1 is shared role of BRCA1 and CHD2 in DNA damage enriched in K562, BCL_disc2 has the highest enrichment repair (83,84) suggests that Novel1 may be involved in in GM12878, TRIM28_disc1 is only enriched in the this or other shared roles for these factors and highlights HEK2932 and U2OS cell lines and EP300_disc7 has en- the utility in shared motif enrichment even outside of motifs richment in the neuroblastoma cell line SK-N-SH-RA and directly tied to a factor. Downloaded from HeLa-S3. This suggests that perhaps AP1 or other factors Similarly, for SIX5, we see only weak enrichment of the recognizing TRE are selectively interacting with these known SIX5 motif and fail to discover a motif similar to proteins depending on the cell line. it. However, Novel2 (using SIX5_disc1) shows over 100- fold enrichment for all three data sets (K562, GM12878 and H1-hESC). Novel2 also shows high enrichment in Novel motifs raise possibility of unknown regulators http://nar.oxfordjournals.org/ data sets for which it was not rediscovered, including Although we are able to putatively explain the majority ATF3 (all data sets have >20-fold enrichment with of the motifs we discover as either matches to previously GM12878 having 106-fold) and NRF1 (all data sets known motifs or low complexity sequences, we do identify have >30-fold enrichment). Moreover, the known 30 putative novel motifs (Figure 5). We placed ZNF143 motif, which is 4-fold enriched in the one these into eight groups on the basis of their similarity: ZNF143 data set, is also not recovered, but Novel2 is Novel1 (BRCA1_disc1, CHD2_disc1, ETS_disc3,6, 24-fold enriched. The breath of data sets sharing this NR3C1_disc3 and ZBTB33_disc1-4), Novel2 (EGR1_ motif suggests it may be recognized by an important yet disc4, ETS_disc1,5,7, SETDB1_disc1, SIX5_disc1-3, unknown or under-characterized regulator. at MIT Libraries on December 14, 2013 SMARC_disc2 and ZNF143_disc1-3), Novel3 Like the known ZBTB7A motif, Novel3 (using (SP2_disc3, TCF12_disc3 and ZBTB7A_disc2), Novel4 SP2_disc3) is largely poly-G, which causes us to underesti- (RFX5_disc3), Novel5 (BDP1_disc2), Novel6 mate its enrichment due to our shuffling process. Despite

Figure 5. Putative novel motifs. We find eight motifs that are not represented in the literature motifs we collected, three of which are found for at least two factor groups. These patterns may represent the binding specificity of the factors for which they are discovered or for other factors that cooperate with them. Nucleic Acids Research, 2013 9 this, however, it does show enrichment in several data sets, responsible for binding, these motifs enable analyzes including for the factor groups for which it was identified. involving population data (88) and for interpreting This motif shows similarity to other poly-G motifs, such GWAS data (89). Two other ENCODE articles also as known SP1 motifs, but appears to be distinct due to its perform motif discovery: (90) produce a non-redundant other bases. list of discovered motifs but do not perform an extensive Novel4 (RFX5_disc3) shows moderate, but consistent analysis of the relationships between factors and (91) use (2- to 6-fold) enrichment across the RFX5 data sets. The DNaseI footprinting data to identify relevant motifs. consensus is composed of two of the same components as Having a motif catalog is also the first step in identify- the known motifs (AAC and TGA), but ordered differ- ing high-quality computational targets of factors, which ently. Consequently, it may represent the binding specifi- may allow the identification of binding sites that were, for city of, for example, an alternative isoform of RFX5. The example, not found in the conditions assayed. Two remaining motifs (Novel5-8), were found for factors that popular strategies are used for this purpose. One is using show cell line-specific enrichments. Consequently, these clustering of motif instances for factors known to cooper- may represent specificities for regulators that are previ- ate to form cis-regulatory modules (92,93). This resource ously unidentified. is well suited for this purpose because it naturally provides sets of motifs that are likely to cooperate.

Experimental and evolutionary validation of novel motifs A second approach is the use of conservation on many Downloaded from closely related species (85,94–97). This can be performed Following the motif discovery and selection of these readily on these motif instances because a dense tree of putative novel motifs, a study released hundreds of new mammalian species has been sequenced readily permitting motifs generated using high-throughput SELEX (16). Two their alignment and measuring selection of a near-nucleo- of the putative novel motifs described in this section match tide level. Because changes in the underlying motif motifs generated by (16): Novel1 matches the motif http://nar.oxfordjournals.org/ matches are largely responsible for changes in binding for ETV6 and Novel6 matches ZBED1. Although we across species (98), evolutionary-based approaches on have incorporated these SELEX motifs into our the motif instances may be a means to deal with the resource, we continue to include Novel1 and Novel6 as high rate of non-functional binding (99–101). putative novel motifs because they were identified without knowledge of these new specificities and thereby strengthen the evidence for the remaining novel motifs. AVAILABILITY Four of these putatively novel motif groups (Novel1–3, 6) match motifs that were previously identified using A web interface, along with data files and accompanying conservation signals across four mammals (85) software, is available at http://compbio.mit.edu/encode- at MIT Libraries on December 14, 2013 (Supplementary Table S5). Therefore, this study motifs. provides additional support for these conservation-based motifs and, conversely, the motifs identified here gain SUPPLEMENTARY DATA comparative evidence. The relatively few distinct novel patterns that are found in this study and the comparative Supplementary Data are available at NAR Online, support for many of the few that are found suggests that including [102–110]. there may be a limited number of human TF motifs with many instances and which interact with one of the assayed factors that remain unknown. ACKNOWLEDGEMENTS The authors thank Ewan Birney, Christopher Bristow, Luke Ward, Jason Ernst, Anshul Kundaje, Gerald Quon DISCUSSION and other members of the Kellis Laboratory for helpful In this article, we provide a systematic and comprehensive discussions. collection of motifs for hundreds of human TF binding data sets. TF binding can be complex, with a factor FUNDING recognizing several or motifs or binding in the apparent absence of any motif [reviewed in (86)]. We also show that National Institutes of Health (NIH) [HG004037, it is possible to identify cofactors that may be partially HG007000 and HG006991]. Funding for open access responsible for binding or function. charge: NIH [HG004037, HG007000 and HG006991]. This motif resource has already been used in several articles while this article was in preparation, demon- Conflict of interest statement. None declared. strating its value for high-throughput analyses. Our motifs are being matched at low stringency to identify REFERENCES peaks that are void of any motif to understand the mech- anism through which motif-less peaks are generated (8). 1. Solomon,M.J., Larsen,P.L. and Varshavsky,A. (1988) Mapping The collection of known motifs and enrichment tech- proteinDNA interactions in vivo with formaldehyde: evidence that histone H4 is retained on a highly transcribed . Cell, 53, niques we present here was also used as a secondary 937–947. validation of peaks (87). Because having the motifs 2. Ren,B., Robert,F., Wyrick,J.J., Aparicio,O., Jennings,E.G., allows for more precisely determining the bases Simon,I., Zeitlinger,J., Schreiber,J., Hannett,N., Kanin,E. et al. 10 Nucleic Acids Research, 2013

(2000) Genome-wide location and function of DNA binding 21. Ettwiller,L., Paten,B., Ramialison,M., Birney,E. and Wittbrodt,J. proteins. Science, 290, 2306–2309. (2007) Trawler: de novo regulatory motif discovery pipeline for 3. Iyer,V.R., Horak,C.E., Scafe,C.S., Botstein,D., Snyder,M. and chromatin immunoprecipitation. Nat. Methods, 4, 563–565. Brown,P.O. (2001) Genomic binding sites of the yeast cell-cycle 22. Che,D., Jensen,S., Cai,L. and Liu,J.S. (2005) BEST: binding-site transcription factors SBF and MBF. Nature, 409, 533–538. estimation suite of tools. Bioinformatics, 21, 2909–2911. 4. Robertson,G., Hirst,M., Bainbridge,M., Bilenky,M., Zhao,Y., 23. Romer,K.A., Kayombya,G. and Fraenkel,E. (2007) WebMOTIFS: Zeng,T., Euskirchen,G., Bernier,B., Varhol,R., Delaney,A. et al. automated discovery, filtering and scoring of DNA sequence (2007) Genome-wide profiles of STAT1 DNA association using motifs using multiple programs and Bayesian approaches. Nucleic chromatin immunoprecipitation and massively parallel sequencing. Acids Res., 35, W217–W220. Nat. Methods, 4, 651–657. 24. Sun,H., Yuan,Y., Wu,Y., Liu,H., Liu,J.S. and Xie,H. (2010) 5. Qi,Y., Rolfe,A., MacIsaac,K.D., Gerber,G.K., Pokholok,D., Tmod: toolbox of motif discovery. Bioinformatics, 26, 405–407. Zeitlinger,J., Danford,T., Dowell,R.D., Fraenkel,E., Jaakkola,T.S. 25. Crooks,G.E., Hon,G., Chandonia,J. and Brenner,S.E. (2004) et al. (2006) High-resolution computational models of genome WebLogo: a sequence logo generator. Genome Res., 14, binding events. Nat. Biotechnol., 24, 963–970. 1188–1190. 6. Guo,Y., Papachristoudis,G., Altshuler,R.C., Gerber,G.K., 26. Bar-Joseph,Z., Gifford,D.K. and Jaakkola,T.S. (2001) Fast Jaakkola,T.S., Gifford,D.K. and Mahony,S. (2010) Discovering optimal leaf ordering for hierarchical clustering. Bioinformatics, homotypic binding events at high spatial resolution. 17, S22–S29. Bioinformatics, 26, 3028–3034. 27. Bairoch,A. (2004) The Universal Protein Resource (UniProt). 7. Li,X.Y., MacArthur,S., Bourgon,R., Nix,D., Pollard,D.A., Nucleic Acids Res., 33, D154–D159. Iyer,V.N., Hechmer,A., Simirenko,L., Stapleton,M., Hendriks,C.L. 28. Pruitt,K.D. and Maglott,D.R. (2001) RefSeq and LocusLink: Downloaded from et al. (2008) Transcription factors bind thousands of active and NCBI gene-centered resources. Nucleic Acids Res., 29, 137–140. inactive regions in the Drosophila blastoderm. PLoS Biol., 6, e27. 29. Maglott,D., Ostell,J., Pruitt,K.D. and Tatusova,T. (2007) Entrez 8. The ENCODE Project Consortium. (2012) An integrated gene: gene-centered information at NCBI. Nucleic Acids Res., 35, encyclopedia of DNA elements in the . Nature, D26–D31. 489, 57–74. 30. Frietze,S., Lan,X., Jin,V.X. and Farnham,P.J. (2010) Genomic 9. Moorman,C., Sun,L.V., Wang,J., deWit,E., Talhout,W., targets of the KRAB and SCAN domain-containing zinc finger protein 263. J. Biol. Chem., 285, 1393–1403. Ward,L.D., Greil,F., Lu,X., White,K.P., Bussemaker,H.J. et al. http://nar.oxfordjournals.org/ (2006) Hotspots of colocalization in the 31. Karin,M., Liu,Z.g. and Zandi,E. (1997) AP-1 function and genome of Drosophila melanogaster. Proc. Natl Acad. Sci. USA, regulation. Curr. Opin. Cell Biol., 9, 240–246. 103, 12027–12032. 32. Kawana,M., Lee,M.E., Quertermous,E.E. and Quertermous,T. 10. Gerstein,M.B., Kundaje,A., Hariharan,M., Landt,S.G., Yan,K.- (1995) Cooperative interaction of GATA-2 and AP1 regulates K., Cheng,C., Mu,X.J., Khurana,E., Rozowsky,J., Alexander,R. transcription of the endothelin-1 gene. Mol. Cell. Biol., 15, et al. (2012) Architecture of the human regulatory network 4225–4231. 33. Wang,W., Xue,Y., Zhou,S., Kuo,A., Cairns,B.R. and derived from ENCODE data. Nature, 489, 91–100. Crabtree,G.R. (1996) Diversity and specialization of mammalian 11. Matys,V., Fricke,E., Geffers,R., Gossling,E., Haubrock,M., SWI/SNF complexes. Genes Dev., 10, 2117–2130. Hehl,R., Hornischer,K., Karas,D., Kel,A.E., Kel-Margoulis,O.V. 34. Ito,T., Yamauchi,M., Nishina,M., Yamamichi,N., Mizutani,T.,

et al. (2003) TRANSFAC(R): transcriptional regulation, from at MIT Libraries on December 14, 2013 Ui,M., Murakami,M. and Iba,H. (2001) Identification of patterns to profiles. Nucleic Acids Res., 31, 374–378. SWI.SNF complex subunit BAF60a as a determinant of the 12. Sandelin,A., Alkema,W., Engstrm,P., Wasserman,W.W. and transactivation potential of Fos/Jun dimers. J. Biol. Chem., 276, Lenhard,B. (2004) JASPAR: an open-access database for 2852–2857. eukaryotic transcription factor binding profiles. Nucleic Acids 35. Nateri,A.S., Spencer-Dene,B. and Behrens,A. (2005) Interaction of Res., 32, D91–D94. phosphorylated c-Jun with TCF4 regulates intestinal cancer 13. Berger,M.F., Philippakis,A.A., Qureshi,A.M., He,F.S., Estep,P.W. development. Nature, 437, 281–285. and Bulyk,M.L. (2006) Compact, universal DNA microarrays to 36. Mostoslavsky,R., Chua,K.F., Lombard,D.B., Pang,W.W., comprehensively determine transcription-factor binding site Fischer,M.R., Gellon,L., Liu,P., Mostoslavsky,G., Franco,S., specificities. Nat. Biotechnol., 24, 1429–1435. Murphy,M.M. et al. (2006) Genomic instability and aging-like 14. Badis,G., Berger,M.F., Philippakis,A.A., Talukder,S., phenotype in the absence of mammalian SIRT6. Cell, 124, Gehrke,A.R., Jaeger,S.A., Chan,E.T., Metzler,G., Vedenko,A., 315–329. Chen,X. et al. (2009) Diversity and complexity in DNA 37. Huang,Y., Myers,S.J. and Dingledine,R. (1999) Transcriptional recognition by transcription factors. Science, 324, 1720–1723. repression by REST: recruitment of Sin3A and histone 15. Berger,M.F., Badis,G., Gehrke,A.R., Talukder,S., deacetylase to neuronal genes. Nat. Neurosci., 2, 867–872. Philippakis,A.A., Pea-Castillo,L., Alleyne,T.M., Mnaimneh,S., 38. Nascimento,E.M., Cox,C.L., Macarthur,S., Hussain,S., Botvinnik,O.B., Chan,E.T. et al. (2008) Variation in Trotter,M., Blanco,S., Suraj,M., Nichols,J., Kbler,B., Benitah,S.A. homeodomain DNA binding revealed by high-resolution analysis et al. (2011) The opposing transcriptional functions of Sin3a and of sequence preferences. Cell, 133, 1266–1276. c-Myc are required to maintain tissue homeostasis. Nat. Cell 16. Jolma,A., Yan,J., Whitington,T., Toivonen,J., Nitta,K.R., Biol., 13, 1395–1405. Rastas,P., Morgunova,E., Enge,M., Taipale,M., Wei,G. et al. 39. Zervos,A.S., Gyuris,J. and Brent,R. (1993) Mxi1, a protein that (2013) DNA-binding specificities of human transcription factors. specifically interacts with Max to bind Myc-Max recognition sites. Cell, 152, 327–339. Cell, 72, 223–232. 17. Hughes,J.D., Estep,P.W., Tavazoie,S. and Church,G.M. (2000) 40. Li-Weber,M., Davydov,I., Krafft,H. and Krammer,P. (1994) The Computational identification of Cis-regulatory elements associated role of NF-Y and IRF-2 in the regulation of human IL-4 gene with groups of functionally related genes in Saccharomyces expression. J. Immunol., 153, 4122–4133. cerevisiae. J. Mol. Biol., 296, 1205–1214. 41. Scott,E., Simon,M., Anastasi,J. and Singh,H. (1994) Requirement 18. Liu,X.S., Brutlag,D.L. and Liu,J.S. (2002) An algorithm for of transcription factor PU.1 in the development of multiple finding protein-DNA binding sites with applications to chromatin- hematopoietic lineages. Science, 265, 1573–1577. immunoprecipitation microarray experiments. Nat. Biotechnol., 20, 42. Villard,J., Peretti,M., Masternak,K., Barras,E., Caretti,G., 835–839. Mantovani,R. and Reith,W. (2000) A functionally essential 19. Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by domain of RFX5 mediates activation of major histocompatibility expectation maximization to discover motifs in biopolymers. complex class II promoters by promoting cooperative binding Proc. Int. Conf. Int. Syst. Mol. Biol., 2, 28–36. between RFX and NF-Y. Mol. Cell. Biol., 20, 3364–3376. 20. Pavesi,G., Mauri,G. and Pesole,G. (2001) An algorithm for 43. Yu,L., Wu,Q., Yang,C.P. and Horwitz,S.B. (1995) Coordination finding signals of unknown length in DNA sequences. of transcription factors, NF-Y and C/EBP beta, in the regulation Bioinformatics, 17, S207–S214. of the mdr1b promoter. Cell Growth Differ., 6, 1505–1512. Nucleic Acids Research, 2013 11

44. Roder,K., Wolf,S., Larkin,K. and Schweizer,M. (1999) Interaction 63. Rubio,E.D., Reiss,D.J., Welcsh,P.L., Disteche,C.M., between the two ubiquitously expressed transcription factors Filippova,G.N., Baliga,N.S., Aebersold,R., Ranish,J.A. and NF-Y and Sp1. Gene, 234, 61–69. Krumm,A. (2008) CTCF physically links cohesin to chromatin. 45. Caretti,G., Salsi,V., Vecchi,C., Imbriano,C. and Mantovani,R. Proc. Natl Acad. Sci. USA, 105, 8309–8314. (2003) Dynamic recruitment of NF-Y and histone 64. Jelinic,P., Stehle,J. and Shaw,P. (2006) The testis-specific acetyltransferases on cell-cycle promoters. J. Biol. Chem., 278, factor CTCFL cooperates with the protein methyltransferase 30435–30440. PRMT7 in H19 imprinting control region methylation. 46. Ivanov,V.N., Bhoumik,A., Krasilnikov,M., Raz,R., PLoS Biol., 4, e355. Owen-Schaub,L.B., Levy,D., Horvath,C.M. and Ronai,Z. (2001) 65. Bischof,L.J., Kagawa,N., Moskow,J.J., Takahashi,Y., Cooperation between STAT3 and c-jun suppresses fas Iwamatsu,A., Buchberg,A.M. and Waterman,M.R. (1998) transcription. Mol. Cell, 7, 517–528. Members of the Meis1 and Pbx homeodomain protein families 47. Choi,S., Cho,Y., Kim,H. and Park,J. (2007) ROS mediate the cooperatively bind a cAMP-responsive sequence (CRS1) from hypoxic repression of the hepcidin gene by inhibiting C/EBPalpha BovineCYP17. J. Biol. Chem., 273, 7941–7948. and STAT-3. Biochem. Biophys. Res. Commun., 356, 312–317. 66. Kappel,A., Schlaeger,T.M., Flamme,I., Orkin,S.H., Risau,W. and 48. Sementchenko,V.I. and Watson,D.K. (2000) Ets target genes: Breier,G. (2000) Role of SCL/Tal-1, GATA, and ets transcription past, present and future. Oncogene, 19, 6533–6548. factor binding sites for the regulation of flk-1 expression during 49. Rothbcher,U., Bertrand,V., Lamy,C. and Lemaire,P. (2007) A murine vascular development. Blood, 96, 3078–3085. combinatorial code of maternal GATA, Ets and beta-catenin- 67. Mouthon,M.A., Bernard,O., Mitjavila,M.T., Romeo,P.H., TCF transcription factors specifies and patterns the early ascidian Vainchenker,W. and Mathieu-Mahul,D. (1993) Expression of tal-1 ectoderm. Development, 134, 4023–4032. and GATA-binding proteins during human hematopoiesis. Blood, Downloaded from 50. Taylor,J.M., Dupont-Versteegden,E.E., Davies,J.D., Hassell,J.A., 81, 647–655. Houl,J.D., Gurley,C.M. and Peterson,C.A. (1997) A role for the 68. Chan,H.M. and La Thangue,N.B. (2001) p300/CBP proteins: ETS domain transcription factor PEA3 in myogenic HATs for transcriptional bridges and scaffolds. J. Cell Sci., 114, differentiation. Mol. Cell. Biol., 17, 5550–5558. 2363–2373. 51. O’Geen,H., Lin,Y., Xu,X., Echipare,L., Komashko,V.M., He,D., 69. Visel,A., Blow,M.J., Li,Z., Zhang,T., Akiyama,J.A., Holt,A., Frietze,S., Tanabe,O., Shi,L., Sartor,M.A. et al. (2010) Genome- Plajzer-Frick,I., Shoukry,M., Wright,C., Chen,F. et al. (2009) wide binding of the orphan nuclear TR4 suggests its ChIP-seq accurately predicts tissue-specific activity of enhancers. http://nar.oxfordjournals.org/ general role in fundamental biological processes. BMC Genomics, Nature, 457, 854–858. 11, 689. 70. Costa,R.H., Kalinichenko,V.V., Holterman,A.L. and Wang,X. 52. Adams,B., Drfler,P., Aguzzi,A., Kozmik,Z., Urbnek,P., (2003) Transcription factors in liver development, differentiation, Maurer-Fogy,I. and Busslinger,M. (1992) Pax-5 encodes the and regeneration. Hepatology, 38, 1331–1347. transcription factor BSAP and is expressed in B lymphocytes, the 71. Zaret,K.S. and Carroll,J.S. (2011) Pioneer transcription factors: developing CNS, and adult testis. Genes Dev., 6, 1589–1607. establishing competence for . Genes Dev., 25, 53. Fitzsimmons,D., Hodsdon,W., Wheat,W., Maira,S.M., Wasylyk,B. 2227–2241. and Hagman,J. (1996) Pax-5 (BSAP) recruits Ets proto-oncogene 72. Johnson,C.A. and Turner,B.M. (1999) Histone deacetylases: family proteins to form functional ternary complexes on a B-cell- complex transducers of nuclear signals. Semin. Cell Dev. Biol., 10, specific promoter. Genes Dev., 10, 2198–2211.

179–188. at MIT Libraries on December 14, 2013 54. Dudek,H., Tantravahi,R.V., Rao,V.N., Reddy,E.S. and 73. Furusawa,T. and Cherukuri,S. (2010) Developmental function of Reddy,E.P. (1992) Myb and Ets proteins cooperate in HMGN proteins. Biochim. Biophys. Acta, 1799, 69–73. transcriptional activation of the mim-1 promoter. Proc. Natl 74. Peng,J., Zhu,Y., Milton,J.T. and Price,D.H. (1998) Identification Acad. Sci. USA, 89, 1291–1295. of multiple cyclin subunits of human P-TEFb. Genes Dev., 12, 55. Mazars,R., Gonzalez-de-Peredo,A., Cayrol,C., Lavigne,A., 755–762. Vogel,J.L., Ortega,N., Lacroix,C., Gautier,V., Huet,G., Ray,A. 75. Partington,G.A. and Patient,R.K. (1999) Phosphorylation of et al. (2010) The THAP-zinc finger protein THAP1 associates GATA-1 increases its DNA-binding affinity and is correlated with with coactivator HCF-1 and O-GlcNAc transferase: a link induction of human K562 erythroleukaemia cells. Nucleic Acids between DYT6 and DYT3 dystonias. J. Biol. Chem., 285, 13364–13371. Res., 27, 1168–1175. 56. Yu,H., Mashtalir,N., Daou,S., Hammond-Martel,I., Ross,J., 76. Ernst,J., Kheradpour,P., Mikkelsen,T.S., Shoresh,N., Ward,L.D., Sui,G., Hart,G.W., Rauscher,F.J.R., Drobetsky,E., Milot,E. et al. Epstein,C.B., Zhang,X., Wang,L., Issner,R., Coyne,M. et al. (2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternary (2011) Mapping and analysis of chromatin state dynamics in nine complex with YY1 and HCF-1 and is a critical regulator of gene human cell types. Nature, 473, 43–49. expression. Mol. Cell. Biol., 30, 5071–5085. 77. Xu,D., Zhao,L., Del Valle,L., Miklossy,J. and Zhang,L. (2008) 57. Looijenga,L.H., Stoop,H., deLeeuw,H.P., deGouveia Brazao,C.A., Interferon regulatory factor 4 is involved in Epstein-Barr virus- Gillis,A.J., vanRoozendaal,K.E., vanZoelen,E.J., Weber,R.F., mediated transformation of human B lymphocytes. J. Virol., 82, Wolffenbuttel,K.P., vanDekken,H. et al. (2003) POU5F1 (OCT3/ 6251–6258. 4) identifies cells with pluripotent potential in human germ cell 78. Paun,A. and Pitha,P.M. (2007) The IRF family, revisited. tumors. Cancer Res., 63, 2244–2250. Biochimie, 89, 744–753. 58. Loh,Y., Wu,Q., Chew,J., Vega,V.B., Zhang,W., Chen,X., 79. Corcoran,L.M., Karvelas,M., Nossal,G.J., Ye,Z.S., Jacks,T. and Bourque,G., George,J., Leong,B., Liu,J. et al. (2006) The Oct4 Baltimore,D. (1993) Oct-2, although not required for early B-cell and Nanog transcription network regulates pluripotency in mouse development, is critical for later B-cell maturation and for embryonic stem cells. Nat. Genet., 38, 431–440. postnatal survival. Genes Dev., 7, 570–582. 59. Yi,F. and Merrill,B.J. (2007) Stem cells and TCF proteins: a role 80. Baeuerle,P.A. and Henkel,T. (1994) Function and activation of for beta-catenin-independent functions. Stem Cell Rev., 3, 39–48. NF-kappa B in the immune system. Annu. Rev. Immunol., 12, 60. Phillips,J.E. and Corces,V.G. (2009) CTCF: master weaver of the 141–179. genome. Cell, 137, 1194–1211. 81. Lee,C.S., Friedman,J.R., Fulmer,J.T. and Kaestner,K.H. (2005) 61. McKay,M.J., Troelstra,C., vander,P., Kanaar,R., Smit,B., The initiation of liver development is dependent on Foxa Hagemeijer,A., Bootsma,D. and Hoeijmakers,J.H. (1996) transcription factors. Nature, 435, 944–947. Sequence conservation of therad21 Schizosaccharomyces 82. Seto,E., Shi,Y. and Shenk,T. (1991) YY1 is an initiator sequence- pombeDNA double-strand break repair gene in human and binding protein that directs and activates transcription in vitro. mouse. Genomics, 36, 305–315. Nature, 354, 241–245. 62. Wendt,K.S., Yoshida,K., Itoh,T., Bando,M., Koch,B., 83. Nagarajan,P., Onami,T.M., Rajagopalan,S., Kania,S., Donnell,R. Schirghuber,E., Tsutsumi,S., Nagae,G., Ishihara,K., Mishiro,T. and Venkatachalam,S. (2009) Role of chromodomain helicase et al. (2008) Cohesin mediates transcriptional insulation by CCCT DNA-binding protein 2 in DNA damage response signaling and C-binding factor. Nature, 451, 796–801. tumorigenesis. Oncogene, 28, 1053–1062. 12 Nucleic Acids Research, 2013

84. Deng,C. (2003) Roles of BRCA1 in DNA damage repair: a link 98. Schmidt,D., Wilson,M.D., Ballester,B., Schwalie,P.C., between development and cancer. Hum. Mol. Genet., 12, Brown,G.D., Marshall,A., Kutter,C., Watt,S., 113R–123R. Martinez-Jimenez,C.P. et al. (2010) Five-vertebrate ChIP-seq 85. Xie,X., Lu,J., Kulbokas,E.J., Golub,T.R., Mootha,V., Lindblad- Reveals the evolutionary dynamics of transcription factor Toh,K., Lander,E.S. and Kellis,M. (2005) Systematic discovery of binding. Science, 328, 1036–1040. regulatory motifs in human promoters and 3[prime] UTRs by 99. Boyer,L.A., Lee,T.I., Cole,M.F., Johnstone,S.E., Levine,S.S., comparison of several mammals. Nature, 434, 338–345. Zucker,J.P., Guenther,M.G., Kumar,R.M., Murray,H.L., 86. Farnham,P.J. (2009) Insights from genomic profiling of Jenner,R.G. et al. (2005) Core transcriptional regulatory circuitry transcription factors. Nat. Rev. Genet., 10, 605–616. in human embryonic stem cells. Cell, 122, 947–956. 87. Landt,S.G., Marinov,G.K., Kundaje,A., Kheradpour,P., Pauli,F., 100. Lee,T.I., Jenner,R.G., Boyer,L.A., Guenther,M.G., Levine,S.S., Batzoglou,S., Bernstein,B.E., Bickel,P., Brown,J.B., Cayting,P. Kumar,R.M., Chevalier,B., Johnstone,S.E., Cole,M.F., et al. (2012) ChIP-seq guidelines and practices of the ENCODE Isono,K.I. et al. (2006) Control of developmental and modENCODE consortia. Genome Res., 22, 1813–1831. regulators by polycomb in human embryonic stem cells. Cell, 88. Spivakov,M., Akhtar,J., Kheradpour,P., Beal,K., Girardot,C., 125, 301–313. Koscielny,G., Herrero,J., Kellis,M., Furlong,E.E. and Birney,E. 101. MacArthur,S., Li,X., Li,J., Brown,J., Chu,H.C., Zeng,L., (2012) Analysis of variation at transcription factor binding sites Grondona,B., Hechmer,A., Simirenko,L., Keranen,S. et al. in Drosophila and humans. Genome Biol., 13, R49. (2009) Developmental roles of 21 Drosophila transcription 89. Ward,L.D. and Kellis,M. (2011) HaploReg: a resource for factors are determined by quantitative differences in binding to exploring chromatin states, conservation, and regulatory motif an overlapping set of thousands of genomic regions. Genome alterations within sets of genetically linked variants. Nucleic Acids Biol., 10, R80. Res., 40, D930–D934. 102. Pietrokovski,S. (1996) Searching databases of conserved sequence Downloaded from 90. Wang,J., Zhuang,J., Iyer,S., Lin,X., Whitfield,T.W., Greven,M.C., regions by aligning protein multiple-alignments. Nucleic Acids Pierce,B.G., Dong,X., Kundaje,A., Cheng,Y. et al. (2012) Res., 24, 3836–3845. Sequence features and chromatin structure around the genomic 103. Gray,K.A., Daugherty,L.C., Gordon,S.M., Seal,R.L., regions bound by 119 human transcription factors. Genome Res., Wright,M.W. and Bruford,E.A. (2013) Genenames.org: 22, 1798–1812. the HGNC resources in 2013. Nucleic Acids Res., 41, 91. Neph,S., Vierstra,J., Stergachis,A.B., Reynolds,A.P., Haugen,E., D545–D552. Vernot,B., Thurman,R.E., John,S., Sandstrom,R., Johnson,A.K. 104. Kharchenko,P.V., Tolstorukov,M.Y. and Park,P.J. (2008) http://nar.oxfordjournals.org/ et al. (2012) An expansive human regulatory lexicon encoded in Design and analysis of ChIP-seq experiments for DNA-binding transcription factor footprints. Nature, 489, 83–90. proteins. Nat. Biotech., 26, 1351–1359. 92. Berman,B.P., Nibu,Y., Pfeiffer,B.D., Tomancak,P., Celniker,S.E., 105. Kent,W.J., Sugnet,C.W., Furey,T.S., Roskin,K.M., Pringle,T.H., Levine,M., Rubin,G.M. and Eisen,M.B. (2002) Exploiting Zahler,A.M. and Haussler,D. (2002) The human genome transcription factor binding site clustering to identify browser at UCSC. Genome Res., 12, 996–1006. cis-regulatory modules involved in pattern formation in the 106. Harrow,J., Frankish,A., Gonzalez,J.M., Tapanari,E., Drosophila genome. Proc. Natl Acad. Sci. USA, 99, 757–762. Diekhans,M., Kokocinski,F., Aken,B.L., Barrell,D., Zadissa,A., 93. Schroeder,M.D., Pearce,M., Fak,J., Fan,H., Unnerstall,U., Searle,S. et al. (2012) GENCODE: The reference human genome Emberly,E., Rajewsky,N., Siggia,E.D. and Gaul,U. (2004) annotation for The ENCODE Project. Genome Res., 22,

Transcriptional control in the segmentation gene network of 1760–1774. at MIT Libraries on December 14, 2013 Drosophila. PLoS Biol., 2, e271. 107. Touzet,H. and Varre,J. (2007) Efficient and accurate P-value 94. Kellis,M., Patterson,N., Endrizzi,M., Birren,B. and Lander,E.S. computation for position weight matrices. Algorithms Mol. Biol., (2003) Sequencing and comparison of yeast species to identify 2, 15. genes and regulatory elements. Nature, 423, 241–254. 108. Wilson,E.B. (1927) Probable Inference, the Law of Succession, 95. Moses,A., Chiang,D., Pollard,D., Iyer,V. and Eisen,M. (2004) and Statistical Inference. J. Am. Stat. Assoc., 22, 209–212. MONKEY: identifying conserved transcription-factor binding 109. Mahony,S., Auron,P.E. and Benos,P.V. (2007) DNA familial sites in multiple alignments using a binding site-specific binding profiles made easy: comparison of various motif evolutionary model. Genome Biol., 5, R98. alignment and clustering strategies. PLoS Comput. Biol., 3, e61. 96. Kheradpour,P., Stark,A., Roy,S. and Kellis,M. (2007) Reliable 110. Sandelin,A. and Wasserman,W.W. (2004) Constrained prediction of regulator targets using 12 Drosophila genomes. binding site diversity within families of transcription factors Genome Res., 17, 1919–1931. enhances pattern discovery bioinformatics. J. Mol. Biol., 338, 97. Lindblad-Toh,K., Garber,M., Zuk,O., Lin,M.F., Parker,B.J., 207–215. Washietl,S., Kheradpour,P., Ernst,J., Jordan,G., Mauceli,E. et al. (2011) A high-resolution map of human evolutionary constraint using 29 mammals. Nature, 478, 476–482. Supplementary Methods

Comparing motifs

Motif similarity is defined as the maximal Pearson’s correlation of the PWM made into a 4xN vector across all offsets and both orientations, padding unmatched positions with N’s (102). Motifs are consid- ered a match if they have a similarity of at least 0.75; weak matches or variants are determined through manual inspection.

Collecting known motifs

Known motifs were collected from large scale data sets or databases. We collected human, mouse, and rat motifs from Transfac (version 11.3) (11), vertebrate motifs from Jaspar (version 2008) (12), and large scale systematic motifs generated using Protein Binding Microarrays (13, 14, 15) and high- throughput SELEX (16). HGNC gene names (103) were used to describe the names of motifs whenever possible.

Processing and naming of experimental data sets

Human protein binding peaks (excluding Pol2/Pol3) were taken from the January, 2011 freeze of EN- CODE (8). The processing for these data sets is described in (104, 10) and the original data sets are available on the website accompanying this paper. To avoid potentially confounding issues, we excluded from our analysis: (1) the Y and mitochondrial ; (2) the hg19 rmsk and simple repeat tracks from UCSC (created on April 27, 2009) (105); and (3) protein coding regions, exons for non- coding genes, and 3’ untranslated regions taken from GENCODE v4 (106). Like with the motif names, HGNC gene names (103) are used to describe the data sets.

Performing de novo motif discovery

Peaks were randomly partitioned into two data sets for the purpose of separating discovery and en- richment and limit over-fitting. The top 250 peaks for the first partition were used in motif discovery because high intensity peaks often have better enrichment for motifs (101). Five tools were run inde- pendently run on each data set: AlignACE (v4.0 with default parameters) (17), MDscan (v2004 with default parameters) (18), MEME (v3.5.7 with a maximum of 10 iterations and -maxw 10, 15, and 25 for the three motifs) (19), Weeder (v1.4.2 with option large) (20), and Trawler (v1.2 with 250 random intergenic blocks for background) (21). Any motifs beyond the top three for any method on one data set were discarded.

Computing enrichments of motifs

Motifs were matched against the genome using a PWM threshold corresponding to a p-value of 4−8 as determined by TFM-Pvalue (107), with the intent to imitate the frequency a fully specified 8-mer would match the genome. Enrichments are always computed by taking a foreground region (i.e. bound regions for a data set) and comparing it to a background region (e.g. Intergenic non-repetitive regions) where regions in the foreground but not in the background are discarded. For a given motif, the fraction of matches that fall within the foreground is computed and divided by the corresponding fraction for shuffled control motifs (control motif generation details in (97)). To prevent spuriously high enrichments due to small counts, we compute a binomial confidence interval (with z=1.5) (108) around both motif and control fractions and take the extreme which leads to the enrichment closest to 1 (the software available on the website implements this enrichment metric). For each factor group we order the motifs from all discovery programs by the enrichment in the second random partition for the discovery data set. For this step only, we restrict our analysis to 10% of the background regions to reduce the amount of computation. We then select discovered motifs for each factor by this rank order discarding any motif that matches a previous one with similarity greater than 0.75. We supplement these discovered motifs with the known motifs from the literature (described above) and rematch the motifs to the complete background regions and produce comparable enrichments for all data sets.

Supplementary Results

Robustness of motif similarity

While it has been shown that short motifs will sometimes have similar correlations just by chance (109, 110), this happens infrequently for the 0.75 similarity threshold with our database of motifs. When two motifs in the database match, scrambling one of the motifs 100 times leads to more than 5 matches only 6.4% of the time. Moreover, a given motif in the database is 34.7 times more likely to match another motif in a database than a scramble. While this statistic is hard to interpret due to, on one hand, the significant redundancy in the database and, on the other, the partial exclusion of similar discovered motifs, it demonstrates that spurious motif matches should be relatively uncommon.

Effect of peak intensity on motif enrichment

While it has been shown that motif enrichment varies by peak intensity (101), we find that our results are largely robust to restricting to only the 1,000 strongest peaks per data set (89% of data sets have at least 1,000 peaks). The Pearson correlation of motif enrichment (for all motifs) for one data set when using all peaks or only the top 1,000 has a median of 0.89 (middle 50%: 0.85 to 0.95). Moreover, the enrichment of the first discovered motif for a factor group has strong correlation between the complete and restrict data sets: for the 55 factor groups with at least two data sets and a discovered motif, the median correlation of the first discovered motif’s enrichment is 0.97 (middle 50%: 0.81 to 1.00). 2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits C GG C G CG G G G C G GG G C G GG G A C G G GG A T G C G T C A G A C T TC C G A CG C CC C A C G C C G G A C G C T GA G C A A CG T A A T G C C A C

C C G C C G G C G C G G G CG C TT A AA T ATT T A AA AT T AA T A TA A 0.0 0.0 0.0 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 AP1 disc10 ATF3 disc3 ATF3 disc4 BDP1 disc3

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits C C G G C G C C G G G G G G C GG C G C G G CC CG C A C G C G C CAC C G C C CCC G GG GT GG AC G T G G A C T CGG TTC CC C T CG TT G C C G G G 0.0 T A 0.0 AAATA 0.0 AAA A T A 0.0 ATTTATTAAA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 BHLHE40 disc2 CHD2 disc2 CHD2 disc3 E2F disc7

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits C bits bits C bits G G C G GG C C C G G G AAG C G G G CG G C G G C G C G C C C TC T C G CG C G CG G G C C G C C T T G C C G C A A G A T G A T A A A T T A

C T A T G G C C C G G GG C C G G G G C C G C CG C C 0.0 AA TTA T A 0.0 A A T 0.0 A A AT 0.0 A T A T A 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 E2F disc8 EBF1 disc2 ELF1 disc2 ELF1 disc3

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits C G C G G bits C C C bits G bits A G C CC C GGGGG G G C G CG C G A C GGC GG C CC GC G CG G C A G G T A A C A C G T A T CT A A C G C A T AAA A A C G G T G C CC C C C C T G G C 0.0 A T 0.0 TTAT T T 0.0 A TTAAA A A 0.0 A T AA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 ESRRA disc4 EGR1 disc6 ETS disc9 GATA disc5

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits C C G C AA GC C GC G GC C G CG C C C CC AG AG AA A G G T A AA A G A C G CT TG A G G C T GGA C GG G T CGC A C C C G A A C T G T A C A G G G C G T T T A C G T G G C C C A G G C C G C C G G C G G G C G G G C C C C G T A T T A T 0.0 T AT 0.0 TT T TT 0.0 T T T 0.0 A T 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 NR3C1 disc6 HDAC2 disc6 HEY1 disc2 MYC disc9

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits G G G G G G C C G C C GC C G C G G G C G G C C G G G AG C C G G A G C G C G A C C G G CC C G C G C C CA GA A A C TAG C A GG C G A A A T C T G T T A C A T T A G T C A A C T A T C G AA A G C C C C C G G G C G C G GG C C G C 0.0 TA A 0.0 AATT TTTTA 0.0 TA TA AT T T 0.0 T AT 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 MYC disc10 NRF1 disc3 REST disc4 REST disc5

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits C C GC GC C G G C G A G G G G C G G G C G CG CGCG G G G G G C C C C CGCCC G A A A A G C G C T A C A C G G T T T C C G CG A T A GA A G A T T G T C G G AA G A C G G G C GC C C C C G G C C C G 0.0 AATTTAAT A 0.0 T A T TT 0.0 T A ATA T 0.0 T T T C 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 EP300 disc8 EP300 disc9 PAX5 disc5 SPI1 disc3

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits C bits bits GC G G G G C C C G C C G G G G CC C C GC G C G G G G C G C C G G GC T T A C G A C A C C C T G A A C C A C G A G A C G C T G A T G T G A A GT AT C A C T T T T C A C T G A T A A T C G C G C G CC G G CG C C C G C C G G C C GG G CG G 0.0 A AT TA 0.0 T T T A 0.0 T T A TA 0.0 TA T 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 POU2F2 disc2 RAD21 disc5 RAD21 disc6 RAD21 disc7

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits C C C G G bits GC CC C C bits G bits C C G T C C CC T C C C C G G G GC G T TT CT C C G C C T T A T A C A T C A A T T A T A C G C C T A C T

G C G G G GG G GG C C C C C G G C G G G G C G C G G 0.0 T T T 0.0 AAT A 0.0 AT TT T 0.0 T AA T 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 11 12 13 RAD21 disc8 SRF disc2 STAT disc6 STAT disc7

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits CC C G GC GC G G G G G C G GCC C G G G T C G C G GC G T G C G G C G G G C C G C G C GC T C A T C T C A G A T G T C C A

A A T C CG G T T A CG C A G G C C G C G G C C C G C C CT G C A G 0.0 A 0.0 AA 0.0 TT A A ATT A 0.0 TT T A 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 SIN3A disc5 SIN3A disc6 SIN3A disc7 TATA disc10

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits C GCT G G GC GC C G C G T G C G GC A CC G C C C G G C A GCA A G C GC G A GC G C C C A A G G G T A A G T A C T A C TCG T G T C G CCGG G C G G C C G C G G C G 0.0 T TA 0.0 A AT 0.0 AAATA TA T 0.0 A T TTT 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 TCF12 disc5 TCF12 disc6 YY1 disc3 YY1 disc4

2.0 2.0

1.0 1.0 bits bits C C GC GC GC GC G C C C GG C G C GGG C GC C CC G G C G T G GT G GA G C CA 0.0 TAT T AT 0.0 T TA A T T 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 YY1 disc5 ZNF143 disc4

Table S1: Low complexity or weakly enriched motifs. 2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits G G bits G bits A G bits C G GG G G G A A G A G G C C C C G T C T C A GT G G A G A TA C C G G G C A T G A G G C T A A T A G A G T A TAG C GT A C T CC C G G C CG C C C C CC G GGC C C G C AT A TTT ATA T TT TT T T T 0.0 A A 0.0 A TA A 0.0 0.0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CTCF disc2 CTCF disc3 CTCF disc4 CTCF disc5

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits A bits G T CC CCAGGGG C CC A C GAA TTAATC AC G GG AG ACCCCA G CGC GG CA GC GG CG G G C G G G T TTTT TTAT T T A 0.0 A T 0.0 0.0 A T 0.0 TTT 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 CTCF disc6 CTCF disc7 CTCF disc10 E2F disc5

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits A bits T bits bits G C A A G C G CACC T GG T A CG G G C GA AA CG G G A A T GC TA G G T GGT AC T G C C T T T C TCA C C C C CA G G C C CC G CCC GCC A T A AT T AATTA A TTTTA T 0.0 A 0.0 0.0 AA 0.0 AATATT 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 EGR1 disc7 ESRRA disc3 ETS disc4 FOXA disc2

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits A bits C bits C AA bits T T T A G C CC G C GC C G G C G G C T A G GG G 0.0 A T AT 0.0 AA A 0.0 A T A 0.0 T AA TT 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 GATA disc4 HNF4 disc2 HNF4 disc3 IRF disc6

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits A C G G G C T ACC C T G G C G G G G A C G G G G C GA GG C C G C G A G C C A C C T T A A CG G CG T AAA A C C C C 0.0 TA TTT 0.0 AAA AA 0.0 TAATTTAA 0.0 AATT TTTTA 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 MYC disc6 MYC disc7 MYC disc8 NRF1 disc3

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits G C C G bits bits bits A T T C CGCC C G G A G G G G AC A G T G T T C G C C A T A T G A T G A A A G T A A

GC C C C C GG CGC CC C G G CCC GGC G G G G 0.0 0.0 A A TA 0.0 AT A T 0.0 A AT TT 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 NFE2 disc4 REST disc2 REST disc3 REST disc6

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits GTC bits C A C bits bits G G A C C TCTAGTGG GTG G C GG T G CGCGCAA CGA C C G T T A T GTATAGT CAAA T C GC G GC CC C C G C G CC C G TA AT T AT AA T A T T 0.0 A A 0.0 TT 0.0 0.0 A 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 REST disc7 REST disc10 RAD21 disc2 RAD21 disc4

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits C bits G bits bits A T T C C G CCC T A C G G T T C CA C A AG GA G A CCA G CC T C G C A T T C G T A A G A G C A GAG T

A G C G CG C C G G G C G G C C G CC GG C TT T A T T T T TTT A T T 0.0 T 0.0 0.0 AA TT 0.0 T T T T 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 11 RAD21 disc9 RAD21 disc10 STAT disc5 SIN3A disc3

2.0 2.0 2.0

1.0 1.0 1.0 bits A bits A bits GA C C CCCC G G GG GC GC G C AA CGC C CGG C G 0.0 T T T 0.0 T T T 0.0 TT TT 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 SIN3A disc4 TCF12 disc3 ZBTB7A disc2

Table S2: Motifs with weak similarity to some other discovered motif. These occur frequently for factors with long known motifs that are amenable to breaking apart. We do not believe that these represent distinct motifs and are likely artifacts of the discovery process. Known motif Discovered motif Known motif Discovered motif

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits T C bits bits G bits A C G A T T C G TG G G G A G A A T C G C G T T A C G T TA A C C G G A T C T TC GC C G A G C C CG C C GC GG GC G G C G G G G AA A A T AA A A T TT A A AA A 0.0 A 0.0 0.0 0.0 A A A T 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 TCF12 known1 TCF12 disc1∗ PRDM1 known2 PRDM1 disc1∗ 3.0 (9.5) 5.6 (11.2) 18.4 (23.3) 19.2 (25.1)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits CG A AC ACT GT C C CT AT GCA A A C T A T A G G C C A G A G T T A C A C T C G A T C G A T C G A C C T G GG TA G TG A C T A C G TG A A T G C G G A A G C G C A C G C A C A AA C A C T A G C A T C G A C C T A T T G T C TT A T C G C G C C G C G G G C CC CC G G G C G G G G G G C G C G G G C T T T A ATT AT T A T T TT TA TT ATA A A A A A 0.0 0.0 0.0 0.0 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ZBTB7A known4 ZBTB7A disc1 CTCF known2 CTCF disc1∗ 1.3 (10.0) 2.2 (43.5) 78.0 (219.6) 77.3 (118.5)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits C C G T CC GGCA CTA C G C G G A C GT C A A C G T G T T A A C A G A T C TT C T G C A C G CA T T C C A C C GC T TG C A G G G A T TC G G A T C A T A G AG G C CG G G C G G C C C C G C C G C AT AT A AA T AA AT T A T T T A A T TT T T T 0.0 0.0 T 0.0 0.0 AA 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 MXI1 known1 MXI1 disc2∗ RFX5 known6 RFX5 disc1∗ 3.2 (9.6) 4.4 (29.5) 28.6 (33.8) 28.3 (60.7)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits G T G G C GCA A C CT A C T C T A G C G A A GG C A G GA A G G C C G A T T G TG T G C C T CC A C C A T G CG G C G G C G C G G C G C T A TTT AT AA A TT TT AAA T A 0.0 T A 0.0 0.0 T 0.0 T A T A 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 NFE2 known1 NFE2 disc1∗ AP1 known3 AP1 disc3∗ 56.4 (78.8) 75.9 (125.7) 45.7 (52.5) 45.1 (82.0)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits A G A A CA G T T A G A TT A C G G GAG C T G C C T CCG A T G G C A G AC T A T C A C G A T A G T T A A A T G G C G G A C C A G T T C A A C GC C GT GG T T A G T G A A C G G C GC G G C G C CC G G C C G C G CC C C C C AAT A TT T AAT TT A A T T T AT T AT T 0.0 A 0.0 A 0.0 T T 0.0 T 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 NFY known6 NFY disc1 FOXA known6 FOXA disc1∗ 72.6 (110.6) 89.9 (84.7) 19.9 (14.4) 19.5 (11.2)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits

C G A G C G C G T C C C G G G A C T T C G A A GC C G C T AG G A G TG A G A C T T T C TT C CG GT T G G T G A A AA T C T C A T T A A G C T TA G T C A T CG CA T T G A T G C C G C C C G G CGC G C CC C C GC GC C G A A AT A A A ATT T A 0.0 TA 0.0 T AT A 0.0 A AT 0.0 T ATAT A 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 YY1 known5 YY1 disc1 MYC known22 MYC disc1∗ 39.5 (38.6) 47.5 (94.3) 104.8 (334.2) 97.8 (139.5)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits C T G C C G GC T A GCGC T T T C A A T A T TT G A G C C GC A C T C G A A A T C C G T A GA C AT C A G C C C C G G CC G C CC G G G C G G A T T A A A TA T T T TTAT 0.0 A T 0.0 T 0.0 T A 0.0 TT AA 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 ZEB1 known1 ZEB1 disc1 CEBPB known8 CEBPB disc1∗ 4.9 (3.4) 5.8 (11.1) 84.9 (100.6) 79.0 (91.4)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits CT GT CAA CA GT GCA C AGA G G T AA A A A T T G A T A G T T T T G G A G T C G T C G A A T CG C T G T T C T G T CG G T G A A A A G A C T C T T A A A A A G C G T C A G C G A C A T T C A GG GG C G G G G G C C G C GG G G C CC C C G C C CC G C C C C A G A T T A T T A A A TATT T AATT T T T T 0.0 0.0 0.0 AA 0.0 A AA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 POU5F1 known3 POU5F1 disc1∗ SPI1 known4 SPI1 disc1∗ 28.3 (22.7) 30.0 (24.8) 26.5 (30.5) 24.6 (37.0)

Table S3: Complete version of Figure 4, with all factors for which a discovered motif matching a known motif was found. When the discovered motif is not disc1, a better ranking motif was found that did not match a literature motif. ∗ indicates that other discovered motifs were found, even after manual exclusion of weakly redundant and low complexity motifs. Known motif Discovered motif Known motif Discovered motif

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits A A bits A bits T GC A T T A GA C AA G A G TC A G A C T A CT A T G T A A A C A G T A C A CC A A A A A AG A TG C A AT T C A A C T G T A C T T C G CG G G C CC C CG C G C C G G G C CC GG C T A ATT A TA TT TTATTTT T A TTT T 0.0 A 0.0 AT 0.0 TA 0.0 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SRF known7 SRF disc1 MEF2 known12 MEF2 disc1∗ 126.6 (88.9) 112.1 (93.8) 12.0 (6.8) 7.9 (6.7)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits C G T G C G C G A G CT A T A A A T C T T A A A G G A T T CA C G C C TA A C T A G T T TT A A T C T C TG G AT C G G G AAA AC G T G C T T T A A A C G T A A T TT C G A G GA A T C C A T C A T A C AT G A C G G G C C G CCG G C C C CG C C G C CC C C C G C C G G C GG G G C A T TT T A TT A A TT T 0.0 0.0 0.0 0.0 A T A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 2 3 4 5 6 7 8 9 10 EBF1 known4 EBF1 disc1 MAF known7 MAF disc1 8.7 (28.0) 7.3 (22.3) 57.8 (68.2) 37.1 (50.1)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits T bits A T T bits bits AAT G G A T C A T CT A AT T C T C G A G TT T AA C A T C C C CAC G T A T G C CA CC C G G A G G T G G C G C A C A T G C TC T G G T G T G GA G T G A C T C A T AT G A GC C C C G G C CG G C GG GG CC G C G G C G A A AAT A T A A TA A ATT T A T T A AT T T A 0.0 0.0 T 0.0 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 NFKB known3 NFKB disc1 NR3C1 known17 NR3C1 disc1∗ 39.8 (78.5) 30.3 (45.7) 33.5 (49.4) 20.8 (25.0)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits C T T G T A C C G AC G A A C C TG C A T C A C G A T C T C C A TT A G C G T TA C T G C T T A A C C G G C G G T T G G C TG A GC G G C A C A G ATTG A T A C A CC G G C T A G A G C C C G G C G GG G G C C G C G G G GG G C A G C AA ATAAT T A A T T AA A A T TAATTT A 0.0 A 0.0 AA 0.0 T 0.0 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ELF1 known2 ELF1 disc1 ESRRA known6 ESRRA disc1∗ 22.9 (68.8) 17.0 (105.9) 35.6 (60.4) 21.0 (39.0)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits G bits bits C A G G bits G CA G TT C C G A AGC C C C G C G T TT A C A A AT G A G C A A C C A T T C A C G GC G G CA C G TA T G T T G C C A T C A A G T GA T G GT A AC G C G G G G G G G G G C C G C CC C C G G C CGG GC G TA TT AT TT A T AA T TTT T T TAA AAA T ATTTA 0.0 0.0 0.0 A 0.0 A A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 RXRA known10 RXRA disc1∗ REST known2 REST disc1 9.6 (12.4) 7.0 (8.2) 66.9 (188.9) 39.4 (115.6)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits G A G A A C C C G T T C G C CCC T G G C T G T A A C C TT T T T C A C T G T T T C A T AA T G CAA A T G GAG T C T T AA C A AT A G G T G A C C TG G C C GG G G G C C C G G C C C C C CCC CG C C C G T AAACTACTAA T A T T A A T T TA A A 0.0 0.0 0.0 0.0 TT A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 9 10 EGR1 known8 EGR1 disc1∗ TCF7L2 known5 TCF7L2 disc2∗ 11.9 (299.6) 8.6 (238.1) 10.9 (7.8) 6.0 (9.0)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits A A T G CA C C A G C A C A GC AT G A G G AG C A G A T T G C T T GC C G A G T A A TC C C T A A C C A G G C G G C A G GC GC A G AC T T A G G G G G A G G GT G C A A T C C A C C A G G T T GC T C C G C C C C G C C G G C C C G T G GGG G G C G G T TA TA A T TT T T TTAAT A A T TTTT AT A TA TT T T 0.0 A 0.0 0.0 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 HNF4 known18 HNF4 disc1∗ NRF1 known2 NRF1 disc1∗ 28.0 (34.2) 20.0 (30.2) 84.7 (1196.5) 41.3 (969.7)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits C G bits G G bits bits G C T A G G G GC CCA CG C G A A G G A AC T T AG A A T A T G C A C C T G T G G G G GC CT G G C C GA T G G T G G T GG A G C A C T A G A A T AG T C C AC A C G G A A G A A G A G G AG A G C C A C T T T A A A A C A C T C T A T T C C T C C T CA A T T AC C C C G C C G C G C C G C C C G C G C G G C G C G C T T T T T T TT T TT T T T AAAA TG T TT A A TAT AT T A 0.0 0.0 0.0 0.0 AT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 PAX5 known5 PAX5 disc1∗ GATA known14 GATA disc1∗ 20.5 (78.2) 14.3 (47.3) 23.8 (31.3) 6.9 (23.4)

Table S3 (continued) Known motif Discovered motif

2.0 2.0

1.0 1.0 bits bits T GCA T A C A T A A TG A C TG A AA T G A T ATC T TTA C T C ATT GG C T GG A CG G G G G T A GG C GC C G CG C C C GC G A AT T A A A T 0.0 0.0 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 POU2F2 known15 POU2F2 disc1 21.1 (5.3) 6.1 (11.0)

2.0 2.0

1.0 1.0 bits bits T G CG T C C T G AA TC C T A T C G CCT CC T GC G A G C C G C A C A C G CG T C G T T T T G G A

G C CG CGG G C C G GG 0.0 A A AT T 0.0 AA AT T T 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 13 E2F known9 E2F disc3∗ 12.1 (127.5) 3.1 (71.2)

2.0 2.0

1.0 1.0 bits bits T T A ACG AC C T GG CGAA CT A GT A C G C C C T G CA T C G G G G C CG G C CCG T AT T A TA A T 0.0 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 BHLHE40 known4 BHLHE40 disc1 70.2 (168.4) 17.3 (167.0)

2.0 2.0

1.0 1.0 bits bits G G T GG CG G G A C GG G G G G G G G C T CC GAA AT ATC G A G T T A T G A TG A T A A A A A A A C C C GG C G C C A T TA T TAA TTATAT 0.0 0.0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SP1 known8 SP1 disc3∗ 5.8 (37.7) 1.0 (8.8)

2.0 2.0

1.0 1.0 bits bits A AGT AC C A C G G C G G C G G CG G C TA A A C GT C C C C G CG GG TT T A A TT A A AA TA 0.0 0.0 A 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 ETS known18 ETS disc2∗ 53.9 (306.1) 8.0 (330.6)

2.0 2.0

1.0 1.0 bits bits C CC CT AA G GT G GG G C T CC C C G T C T AA G G AT A T C A G G G G G G C C G A T A T T T T A A A T A A T AA AA A 0.0 A A A A 0.0 A A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 IRF known20 IRF disc3∗ 61.0 (74.1) 8.8 (15.4)

2.0 2.0

1.0 1.0 bits TGAC TT G C T bits AC TT GA A C C C C C A C C CA C T A TCC C G C AC C T G A G T C G G CA GC C G G A G G G GGC GC 0.0 AT AA TTTT A 0.0 TTA ATAA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 NR2C2 known1 NR2C2 disc2∗ 53.4 (51.8) 4.0 (33.4)

2.0 2.0

1.0 1.0 bits C bits C CA G G TC C C C C C C T C C T T C G T A T T GC GGAAA C A T G T A A T T G A A A G CT T C G G G G G G G C GC GG C 0.0 A T TTA AA AAT T 0.0 TTAT TAA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 2 3 4 5 6 7 8 9 10 STAT known2 STAT disc1∗ 116.3 (131.0) 2.4 (56.1)

Table S3 (continued) Factor Matching discovered motifs

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits G C C C G CC C C C C T A T A GT T A G T G G C C G G C C T A G C G A C AC G G G T C GG A G T T A

G C G G C C 0.0 TTTATAAAA 0.0 T AATAAA T 0.0 TT AATAT A 0.0 TT ATT AA 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 AP1 known5 GATA disc2 (32) SMARC disc1 (34) STAT disc2 (46) (TRE)

2.0

1.0 bits

A G T G C C

A C

T C G C GA 0.0 T ATT A 1 2 3 4 5 6 7 8 9 10 TCF7L2 disc1 (35)

2.0 2.0

1.0 1.0 bits bits GT C TT CA AC A GGT T T G AG A AA G G T C A C A G T T A C

C G C G G G C GCC T T TA TA T 0.0 0.0 TT AAT 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 CEBPB known7 STAT disc4∗ (47)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits bits bits bits C GG G G C AT A C GG A C C GC GC CT C C C C G C C GC G A T C A A G TA C C A A GA A T A T G TA T G A TA G T GTA G A C A A G G G C G TG C G A G G T G T G TG A G G G G G G T T A C A T G C C AA A T G A T T A A A C G T C G G A A C A A A A G C T A C A C TT C T A T T T T T A A A T G A G C C C C C C C C C GC G C G G G G C G G C C G G C G G C G GG A A TAAT T T T A A T TA TT T TT T T T AT AA A 0.0 0.0 A 0.0 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 CTCF known1 CTCFL disc1∗ (64) RAD21 disc1∗ (62) SMC3 disc1 (63)

2.0 2.0

1.0 1.0 bits bits C G AG GAC C T C AA G CA C G A G C A T AA A T G A G C T C C G GG C C GG 0.0 TT TA TT 0.0 A TA T AT 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 ETS known7 NR2C2 disc1 (51)

2.0 2.0 2.0

1.0 1.0 1.0 bits C A bits A bits A A GT C G G T C G G C TG GG T A G A G GG A GA A A G G C A A G C T C CCAAA T G AT G G C CG C CG C G GG C C C C G C C A T T A TT AAT AT T 0.0 A A 0.0 0.0 A AA A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 ETS known4 GATA disc3 (49) MEF2 disc2 (50)

2.0 2.0

1.0 1.0 bits bits G A A T A AG T G C G C A A G CGTG C GA G G G GA G C A A T A G A C G CT C G C G C G C T T GC A T C A G T T A T TT C T G C A A C T GCG C C G C GCC AAAA TG AT T T AA T T TGA TT 0.0 0.0 A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 GATA known14 TAL1 disc1 (66, 67)

2.0 2.0

1.0 1.0 bits bits G T T A G A G A G C A G T T C T T C C G A T T A AAG G T A A A C CGA A C C G C G C G G G C C G G G CG 0.0 T ATT T 0.0 AA T 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 MEIS1 1 PBX3 disc2 (65)

Table S4: Selected shared motifs with literature support. Shown are the motifs that match a known motif for the indicated factor along with relevant citations. Details in the text. ∗ indicates motif is reverse complemented for comparison purposes. Factor Matching discovered motifs

2.0 2.0

1.0 1.0 bits bits C G AA T C C GGA A C T G A CC G C G CGGC A C 0.0 A ATAT T 0.0 AA TT 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 MYB 4 ETS disc8 (54)

2.0 2.0

1.0 1.0 bits bits C A A T GG G T A C CG C GG CC A AT C G T GC G C AAC A A G C T T AT T C G G GC GC G C G C C G T T CA GT A TA 0.0 TA 0.0 ATAT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 MYC known3 SIN3A disc2 (38)

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0 bits G bits C bits bits A CA A CAG CA AG A A C CT A G G C T C C A C TG G G G GA G T A C T A G TGGCA C C A A T A G C C C G C GCC G G C G G G C CC CC 0.0 AAT AT TT 0.0 A A 0.0 TT AATA 0.0 T AATT 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 NFY known5 CEBPB disc2 (43) E2F disc4∗ (45) IRF disc1 (40)

2.0 2.0

1.0 1.0 bits bits A AG CA AG G G C AA G G G GA A G AC G G T C G G G G A A TC G C C A C G C C G C CT T C GT TA T C C A C GT A AAA G T C GAC A A C C C GG G C C G CC C C C T TT T A T A AT TT TTT T A A A 0.0 A 0.0 A 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 RFX5 disc2 (42) SP1 disc1∗ (44)

2.0 2.0

1.0 1.0 bits bits CT T CA CT G T G G A T C T T C A G TT G A T A G T C T T GG T ACC TTC T CG T T C C C A A T A CAT TT A C A G T A A A A C C C T G G A C C A G A C G G G C T G G C G G GG G C C G C C G C G G C G A T T A A A A AA 0.0 0.0 TT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 POU5F1 known3 NANOG disc2∗ (57, 58)

2.0 2.0

1.0 1.0 bits A bits T C CC C AC GC C C C GG GC C C G AC A T G GG A A A T T T C GT GA A C A G T A G ATA T G GA T G G GG C G G C C G G C G G C C G G T AT A A T TA ATAAT AA A 0.0 0.0 A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 REST known3 SIN3A disc1∗ (37)

2.0 2.0

1.0 1.0 bits bits A G AGG GT AA T AA G AC A G CA CA C G GG G G GGG GGC TA AT 0.0 A 0.0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 SPI1 known2 IRF disc5 (41)

2.0 2.0

1.0 1.0 bits bits C GC T C G T T AG C TCTT G A G A A A TC T C G AG G CGC G C C C G A A 0.0 TA 0.0 AA AT TT 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 YY1 known5 THAP1 disc2 (55, 56)

Table S4 (continued) Discovered motif Motif from (85)

2.0 2.0

1.0 1.0 bits bits C G TCT A C A G A GG G G CA G C CC CG G CGCG AT A TT T T 0.0 0.0 T A A 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 BRCA1 disc1 (Novel1) M8

2.0 2.0

1.0 1.0 bits T bits C CA GGGAGT A AG CCA AC GT C C G G GGG G C G ATTT T T 0.0 T TA TT 0.0 TA T 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 ETS disc1 (Novel2) M4

2.0

1.0 bits A AA C GTTG 0.0 TT TT 1 2 3 4 5 6 7 8 9 10 ETS disc1 (Novel2) M108/FAC1

2.0 2.0

1.0 1.0 bits bits AA GGAGTTGT TGACCCG AACCGG C T T C GG G G C 0.0 TTAAAA 0.0 A TT T 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 ETS disc5 (Novel2) M129

2.0 2.0

1.0 1.0 bits bits CA ATT AGA TG G GGG CCCG GG G T T 0.0 TA 0.0 A AT A 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 SIX5 disc2 (Novel2) M75/STAT1

2.0 2.0

1.0 1.0 bits GG bits A TC GGGCGG GGGGCG G 0.0 AA AT 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 SP2 disc3 (Novel3) M6/SP1

2.0 2.0

1.0 1.0 bits bits C A CCCC GG CGGAA G C G GCG C 0.0 TT TT 0.0 TTT 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 ZBTB7A disc2 (Novel3) M164

2.0 2.0

1.0 1.0 bits G bits C A G GCG C GG CGG A T T 0.0 AA A 0.0 AA AA 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 12 TATA disc5 (Novel6) M21

Table S5: Novel motifs matching those found using systematic conservation measures in four mammals by Xie et al. (85). One representative motif is shown for each putative novel motif when multiple match the same motif. Each motif from (85) is named using the provided identifier and annotation. References

8. The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74.

10. Gerstein, M. B., Kundaje, A., Hariharan, M., Landt, S. G., Yan, K.-K., Cheng, C., Mu, X. J., Khurana, E., Rozowsky, J., Alexander, R., Min, R., Alves, P., Abyzov, A., Addleman, N., Bhardwaj, N., Boyle, A. P., Cayting, P., Charos, A., Chen, D. Z., Cheng, Y., Clarke, D., Eastman, C., Euskirchen, G., Frietze, S., Fu, Y., Gertz, J., Grubert, F., Harmanci, A., Jain, P., Kasowski, M., Lacroute, P., Leng, J., Lian, J., Monahan, H., O’Geen, H., Ouyang, Z., Partridge, E. C., Patacsil, D., Pauli, F., Raha, D., Ramirez, L., Reddy, T. E., Reed, B., Shi, M., Slifer, T., Wang, J., Wu, L., Yang, X., Yip, K. Y., Zilberman-Schapira, G., Batzoglou, S., Sidow, A., Farnham, P. J., Myers, R. M., Weissman, S. M., and Snyder, M. (2012) Architecture of the human regulatory network derived from ENCODE data. Nature, 489, 91–100.

11. Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A. E., Kel-Margoulis, O. V., Kloos, D., Land, S., Lewicki-Potapov, B., Michael, H., Munch, R., Reuter, I., Rotert, S., Saxel, H., Scheer, M., Thiele, S., and Wingender, E. (2003) TRANSFAC(R): transcriptional regulation, from patterns to profiles. Nucleic Acids Research, 31, 374–378.

12. Sandelin, A., Alkema, W., Engstrm, P., Wasserman, W. W., and Lenhard, B. (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research, 32, D91–94.

13. Berger, M. F., Philippakis, A. A., Qureshi, A. M., He, F. S., Estep, P. W., and Bulyk, M. L. (2006) Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature Biotechnology, 24, 1429–1435.

14. Badis, G., Berger, M. F., Philippakis, A. A., Talukder, S., Gehrke, A. R., Jaeger, S. A., Chan, E. T., Metzler, G., Vedenko, A., Chen, X., Kuznetsov, H., Wang, C., Coburn, D., Newburger, D. E., Morris, Q., Hughes, T. R., and Bulyk, M. L. (2009) Diversity and Complexity in DNA Recognition by Transcription Factors. Science, 324, 1720–1723.

15. Berger, M. F., Badis, G., Gehrke, A. R., Talukder, S., Philippakis, A. A., Pea-Castillo, L., Alleyne, T. M., Mnaimneh, S., Botvinnik, O. B., Chan, E. T., Khalid, F., Zhang, W., Newburger, D., Jaeger, S. A., Morris, Q. D., Bulyk, M. L., and Hughes, T. R. (2008) Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell, 133, 1266–1276.

16. Jolma, A., Yan, J., Whitington, T., Toivonen, J., Nitta, K. R., Rastas, P., Morgunova, E., Enge, M., Taipale, M., Wei, G., Palin, K., Vaquerizas, J. M., Vincentelli, R., Luscombe, N. M., Hughes, T. R., Lemaire, P., Ukkonen, E., Kivioja, T., and Taipale, J. (2013) DNA-Binding Specificities of Human Transcription Factors. Cell, 152, 327–339.

17. Hughes, J. D., Estep, P. W., Tavazoie, S., and Church, G. M. (2000) Computational identification of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. Journal of Molecular Biology, 296, 1205–1214.

18. Liu, X. S., Brutlag, D. L., and Liu, J. S. (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnol- ogy, 20, 835–9.

19. Bailey, T. L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the International Conference on Intelligent Systems for Molecular Biology, 2, 28–36.

20. Pavesi, G., Mauri, G., and Pesole, G. (2001) An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics, 17, S207–S214.

21. Ettwiller, L., Paten, B., Ramialison, M., Birney, E., and Wittbrodt, J. (2007) Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation. Nature Methods, 4, 563– 565.

32. Kawana, M., Lee, M. E., Quertermous, E. E., and Quertermous, T. (1995) Cooperative interac- tion of GATA-2 and AP1 regulates transcription of the endothelin-1 gene. Molecular and Cellular Biology, 15, 4225–4231.

34. Ito, T., Yamauchi, M., Nishina, M., Yamamichi, N., Mizutani, T., Ui, M., Murakami, M., and Iba, H. (2001) Identification of SWI.SNF complex subunit BAF60a as a determinant of the transacti- vation potential of Fos/Jun dimers. The Journal of Biological Chemistry, 276, 2852–2857.

35. Nateri, A. S., Spencer-Dene, B., and Behrens, A. (2005) Interaction of phosphorylated c-Jun with TCF4 regulates intestinal cancer development. Nature, 437, 281–285. 37. Huang, Y., Myers, S. J., and Dingledine, R. (1999) Transcriptional repression by REST: recruitment of Sin3A and histone deacetylase to neuronal genes. Nature Neuroscience, 2, 867–872.

38. Nascimento, E. M., Cox, C. L., Macarthur, S., Hussain, S., Trotter, M., Blanco, S., Suraj, M., Nichols, J., Kbler, B., Benitah, S. A., Hendrich, B., Odom, D. T., and Frye, M. (2011) The opposing transcriptional functions of Sin3a and c-Myc are required to maintain tissue homeostasis. Nature Cell Biology, 13, 1395–1405.

40. Li-Weber, M., Davydov, I., Krafft, H., and Krammer, P. (1994) The role of NF-Y and IRF-2 in the regulation of human IL-4 gene expression. The Journal of Immunology, 153, 4122 –4133.

41. Scott, E., Simon, M., Anastasi, J., and Singh, H. (1994) Requirement of transcription factor PU.1 in the development of multiple hematopoietic lineages. Science, 265, 1573 –1577.

42. Villard, J., Peretti, M., Masternak, K., Barras, E., Caretti, G., Mantovani, R., and Reith, W. (2000) A Functionally Essential Domain of RFX5 Mediates Activation of Major Histocompatibility Complex Class II Promoters by Promoting Cooperative Binding between RFX and NF-Y. Molecular and Cellular Biology, 20, 3364 –3376.

43. Yu, L., Wu, Q., Yang, C. P., and Horwitz, S. B. (1995) Coordination of transcription factors, NF-Y and C/EBP beta, in the regulation of the mdr1b promoter. Cell Growth & Differentiation: The Molecular Biology Journal of the American Association for Cancer Research, 6, 1505–1512.

44. Roder, K., Wolf, S., Larkin, K., and Schweizer, M. (1999) Interaction between the two ubiquitously expressed transcription factors NF-Y and Sp1. Gene, 234, 61–69.

45. Caretti, G., Salsi, V., Vecchi, C., Imbriano, C., and Mantovani, R. (2003) Dynamic Recruitment of NF-Y and Histone Acetyltransferases on Cell-cycle Promoters. Journal of Biological Chemistry, 278, 30435 –30440.

46. Ivanov, V. N., Bhoumik, A., Krasilnikov, M., Raz, R., Owen-Schaub, L. B., Levy, D., Horvath, C. M., and Ronai, Z. (2001) Cooperation between STAT3 and c-Jun Suppresses Fas Transcription. Molecular Cell, 7, 517–528.

47. Choi, S., Cho, Y., Kim, H., and Park, J. (2007) ROS mediate the hypoxic repression of the hepcidin gene by inhibiting C/EBPalpha and STAT-3. Biochemical and Biophysical Research Communica- tions, 356, 312–317. 49. Rothbcher, U., Bertrand, V., Lamy, C., and Lemaire, P. (2007) A combinatorial code of maternal GATA, Ets and beta-catenin-TCF transcription factors specifies and patterns the early ascidian ectoderm. Development, 134, 4023–4032.

50. Taylor, J. M., Dupont-Versteegden, E. E., Davies, J. D., Hassell, J. A., Houl, J. D., Gurley, C. M., and Peterson, C. A. (1997) A role for the ETS domain transcription factor PEA3 in myogenic differentiation. Molecular and Cellular Biology, 17, 5550–5558.

51. O’Geen, H., Lin, Y., Xu, X., Echipare, L., Komashko, V. M., He, D., Frietze, S., Tanabe, O., Shi, L., Sartor, M. A., Engel, J. D., and Farnham, P. J. (2010) Genome-wide binding of the orphan TR4 suggests its general role in fundamental biological processes. BMC Genomics, 11, 689.

54. Dudek, H., Tantravahi, R. V., Rao, V. N., Reddy, E. S., and Reddy, E. P. (1992) Myb and Ets proteins cooperate in transcriptional activation of the mim-1 promoter. Proceedings of the National Academy of Sciences, 89, 1291 –1295.

55. Mazars, R., Gonzalez-de-Peredo, A., Cayrol, C., Lavigne, A., Vogel, J. L., Ortega, N., Lacroix, C., Gautier, V., Huet, G., Ray, A., Monsarrat, B., Kristie, T. M., and Girard, J. (2010) The THAP- zinc finger protein THAP1 associates with coactivator HCF-1 and O-GlcNAc transferase: a link between DYT6 and DYT3 dystonias. The Journal of Biological Chemistry, 285, 13364–13371.

56. Yu, H., Mashtalir, N., Daou, S., Hammond-Martel, I., Ross, J., Sui, G., Hart, G. W., Rauscher, Frank J, r., Drobetsky, E., Milot, E., Shi, Y., and Affar, E. B. (2010) The ubiquitin carboxyl hydrolase BAP1 forms a ternary complex with YY1 and HCF-1 and is a critical regulator of gene expression. Molecular and Cellular Biology, 30, 5071–5085.

57. Looijenga, L. H. J., Stoop, H., de Leeuw, H. P. J. C., de Gouveia Brazao, C. A., Gillis, A. J. M., van Roozendaal, K. E. P., van Zoelen, E. J. J., Weber, R. F. A., Wolffenbuttel, K. P., van Dekken, H., Honecker, F., Bokemeyer, C., Perlman, E. J., Schneider, D. T., Kononen, J., Sauter, G., and Oosterhuis, J. W. (2003) POU5F1 (OCT3/4) identifies cells with pluripotent potential in human germ cell tumors. Cancer Research, 63, 2244–2250.

58. Loh, Y., Wu, Q., Chew, J., Vega, V. B., Zhang, W., Chen, X., Bourque, G., George, J., Leong, B., Liu, J., Wong, K., Sung, K. W., Lee, C. W. H., Zhao, X., Chiu, K., Lipovich, L., Kuznetsov, V. A., Robson, P., Stanton, L. W., Wei, C., Ruan, Y., Lim, B., and Ng, H. (2006) The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nature Genetics, 38, 431–440.

62. Wendt, K. S., Yoshida, K., Itoh, T., Bando, M., Koch, B., Schirghuber, E., Tsutsumi, S., Nagae, G., Ishihara, K., Mishiro, T., Yahata, K., Imamoto, F., Aburatani, H., Nakao, M., Imamoto, N., Maeshima, K., Shirahige, K., and Peters, J. (2008) Cohesin mediates transcriptional insulation by CCCTC-binding factor. Nature, 451, 796–801.

63. Rubio, E. D., Reiss, D. J., Welcsh, P. L., Disteche, C. M., Filippova, G. N., Baliga, N. S., Aebersold, R., Ranish, J. A., and Krumm, A. (2008) CTCF physically links cohesin to chromatin. Proceedings of the National Academy of Sciences of the United States of America, 105, 8309–8314.

64. Jelinic, P., Stehle, J., and Shaw, P. (2006) The Testis-Specific Factor CTCFL Cooperates with the Protein Methyltransferase PRMT7 in H19 Imprinting Control Region Methylation. PLoS Biol, 4, e355.

65. Bischof, L. J., Kagawa, N., Moskow, J. J., Takahashi, Y., Iwamatsu, A., Buchberg, A. M., and Waterman, M. R. (1998) Members of the Meis1 and Pbx Homeodomain Protein Families Co- operatively Bind a cAMP-responsive Sequence (CRS1) from BovineCYP17. Journal of Biological Chemistry, 273, 7941 –7948.

66. Kappel, A., Schlaeger, T. M., Flamme, I., Orkin, S. H., Risau, W., and Breier, G. (2000) Role of SCL/Tal-1, GATA, and ets transcription factor binding sites for the regulation of flk-1 expression during murine vascular development. Blood, 96, 3078–3085.

67. Mouthon, M. A., Bernard, O., Mitjavila, M. T., Romeo, P. H., Vainchenker, W., and Mathieu- Mahul, D. (1993) Expression of tal-1 and GATA-binding proteins during human hematopoiesis. Blood, 81, 647–655.

85. Xie, X., Lu, J., Kulbokas, E. J., Golub, T. R., Mootha, V., Lindblad-Toh, K., Lander, E. S., and Kellis, M. (2005) Systematic discovery of regulatory motifs in human promoters and 3[prime] UTRs by comparison of several mammals. Nature, 434, 338–345.

97. Lindblad-Toh, K., Garber, M., Zuk, O., Lin, M. F., Parker, B. J., Washietl, S., Kheradpour, P., Ernst, J., Jordan, G., Mauceli, E., Ward, L. D., Lowe, C. B., Holloway, A. K., Clamp, M., Gnerre, S., Alfoldi, J., Beal, K., Chang, J., Clawson, H., Cuff, J., Di Palma, F., Fitzgerald, S., Flicek, P., Guttman, M., Hubisz, M. J., Jaffe, D. B., Jungreis, I., Kent, W. J., Kostka, D., Lara, M., Martins, A. L., Massingham, T., Moltke, I., Raney, B. J., Rasmussen, M. D., Robinson, J., Stark, A., Vilella, A. J., Wen, J., Xie, X., Zody, M. C., Worley, K. C., Kovar, C. L., Muzny, D. M., Gibbs, R. A., Warren, W. C., Mardis, E. R., Weinstock, G. M., Wilson, R. K., Birney, E., Margulies, E. H., Herrero, J., Green, E. D., Haussler, D., Siepel, A., Goldman, N., Pollard, K. S., Pedersen, J. S., Lander, E. S., and Kellis, M. (2011) A high-resolution map of human evolutionary constraint using 29 mammals. Nature, 478, 476–482.

101. MacArthur, S., Li, X., Li, J., Brown, J., Chu, H. C., Zeng, L., Grondona, B., Hechmer, A., Simirenko, L., Keranen, S., Knowles, D., Stapleton, M., Bickel, P., Biggin, M., and Eisen, M. (2009) Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions. Genome Biology, 10, R80.

102. Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments.. Nucleic Acids Research, 24, 3836–3845.

103. Gray, K. A., Daugherty, L. C., Gordon, S. M., Seal, R. L., Wright, M. W., and Bruford, E. A. (2013) Genenames.org: the HGNC resources in 2013. Nucleic acids research, 41, D545–552.

104. Kharchenko, P. V., Tolstorukov, M. Y., and Park, P. J. (2008) Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotech, 26, 1351–1359.

105. Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., and Haussler, D. (2002) The Human Genome Browser at UCSC. Genome Research, 12, 996 –1006.

106. Harrow, J., Frankish, A., Gonzalez, J. M., Tapanari, E., Diekhans, M., Kokocinski, F., Aken, B. L., Barrell, D., Zadissa, A., Searle, S., Barnes, I., Bignell, A., Boychenko, V., Hunt, T., Kay, M., Mukherjee, G., Rajan, J., Despacio-Reyes, G., Saunders, G., Steward, C., Harte, R., Lin, M., Howald, C., Tanzer, A., Derrien, T., Chrast, J., Walters, N., Balasubramanian, S., Pei, B., Tress, M., Rodriguez, J. M., Ezkurdia, I., van Baren, J., Brent, M., Haussler, D., Kellis, M., Valencia, A., Reymond, A., Gerstein, M., Guig, R., and Hubbard, T. J. (2012) GENCODE: The reference human genome annotation for The ENCODE Project. Genome research, 22, 1760–1774.

107. Touzet, H. and Varre, J. (2007) Efficient and accurate P-value computation for Position Weight Matrices. Algorithms for Molecular Biology, 2, 15. 108. Wilson, E. B. (1927) Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association, 22, 209–212.

109. Mahony, S., Auron, P. E., and Benos, P. V. (2007) DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering Strategies. PLoS Comput Biol, 3, e61.

110. Sandelin, A. and Wasserman, W. W. (2004) Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. Journal of molecular biology, 338, 207–215.