Hum Genet (2014) 133:701–711 DOI 10.1007/s00439-013-1413-1

Review Paper

The grammar of transcriptional regulation

Shira Weingarten-Gabbay · Eran Segal

Received: 11 June 2013 / Accepted: 24 December 2013 / Published online: 5 January 2014 © Springer-Verlag Berlin Heidelberg 2014

Abstract eukaryotes employ combinatorial strategies to regulatory motifs (Levine and Tjian 2003) and exploit motif generate a variety of expression patterns from a relatively geometry as another dimension of combinatorial power small set of regulatory DNA elements. As in any other lan- for regulating transcription. Understanding the fundamen- guage, deciphering the mapping between DNA and expres- tal principles governing transcriptional regulation could sion requires an understanding of the set of rules that govern allow us to predict expression from DNA sequence, with basic principles in transcriptional regulation, the functional far-reaching implications. Most notably, in many human elements involved, and the ways in which they combine to diseases, genetic changes occur in non-coding regions such orchestrate a transcriptional output. Here, we review the cur- as promoters and enhancers. However, without under- rent understanding of various grammatical rules, including standing the grammar of transcriptional regulation, we the effect on expression of the number of transcription factor cannot tell which sequence changes affect expression and binding sites, their location, orientation, affinity and activity; how. For example, even for a single binding site, we do not co-association with different factors; and intrinsic nucleo- know the quantitative effects on expression of its location, some organization. We review different methods that are orientation, and affinity; whether these effects are general, used to study the grammar of transcription regulation, high- factor-specific, and/or promoter-dependent; and how they light gaps in current understanding, and discuss how recent depend on the intrinsic nucleosome organization. Similarly, technological advances may be utilized to bridge them. we do not know which properties determine whether mul- tiple sites contribute additively or cooperatively to expres- sion, what types of cooperativity functions can be achieved, Introduction how they depend on the affinity of the sites and identity of the factors, and whether their mechanistic basis involves Proper control of mRNA levels is critical in nearly all bio- protein–protein interactions and/or nucleosome eviction. logical processes. Since much of this control is encoded Unraveling this transcriptional grammar will allow us to within non-coding regulatory regions, deciphering the map- understand, predict and design expression patterns from ping between DNA sequence and expression levels is key regulatory sequences (Fig. 1). for understanding transcriptional control. Eukaryotes are Addressing this challenge requires knowledge of both known to employ combinatorial strategies to generate a the functional elements and the ways in which such ele- variety of expression patterns from a relatively small set of ments combine to orchestrate a transcriptional output. Testing the effect of designed DNA mutations has been successfully employed for several decades in the research S. Weingarten‑Gabbay · E. Segal (*) of transcriptional control, but on the scale of a handful of Department of Computer Science, Applied Mathematics sequences per study. A major hindrance to progress is the and Department of Molecular Cell Biology, Weizmann Institute of Science, 76100 Rehovot, Israel limited ability to measure the transcriptional effect of a e-mail: [email protected] large number of designed DNA sequences in which specific S. Weingarten‑Gabbay regulatory elements are systematically varied. Recently e-mail: [email protected] developed technologies increase the throughput of these

1 3 702 Hum Genet (2014) 133:701–711

Fig. 1 Illustration of grammatical rules and their effect on gene which each rule affects expression. For example: cooperative bind- expression. Transcriptional grammar: the rules of transcription regu- ing results in a sigmoidal curve since one binding event increases the lation are illustrated as “grammar cards”. Each card represents a dif- probability of another binding to occur, resulting in higher expres- ferent rule. For example, the number of binding sites for the same sion. The x axis changes according to the rule identity. For example, TF, the distance of one TF from the TSS, the orientation of the bind- for the rule of binding site location, the x axis represents distance ing site relative to the TSS, etc. Expression function: the function by from the TSS (in bp) experiments by ~1,000-fold, allowing us to gain consider- sequences to decipher the principles governing transcrip- ably more insight into how information is encoded in the tion regulation. These include comparative computational language of DNA. In this review, we discuss several exam- models (Nguyen and D’Haeseleer 2006; Beer and Tava- ples of grammatical rules in transcription, highlight the zoie 2004; Segal et al. 2002), high-throughput assays to main gaps, and discuss how these may be bridged using map functional elements in the such as TF bind- recent technological advances. ing sites and nucleosomes (Morin et al. 2008; Johnson et al. 2007; Crawford et al. 2006; Gaulton et al. 2010; Dostie and Dekker 2007), and classical genetic techniques Methods to decipher the grammar of transcription including reporter assays for quantitative activity measure- ments (Takahashi et al. 1986; Giniger and Ptashne 1988; A broad range of methods exists for annotating and test- Burz et al. 1998). Accumulation of genome-wide data ing functional regulatory elements in non-coding DNA on gene expression (RNA-seq) (Morin et al. 2008), TF

1 3 Hum Genet (2014) 133:701–711 703 binding landscape (Chip-seq) (Johnson et al. 2007), chro- locations. Such collection of promoters could not be gener- matin state [DNase-seq (Crawford et al. 2006) and FAIRE- ated by random ligation of regulatory elements. Systematic seq (Gaulton et al. 2010)], and physical DNA interactions manipulation of some specific promoters (Takahashi et al. (5C) (Dostie and Dekker 2007) led to the identification of 1986; Giniger and Ptashne 1988; Burz et al. 1998) leads to potential promoter and enhancer regions, the TFs bound profound insights, but since the variants were constructed to these regions, and the chromatin architecture (Dunham one by one, time and cost considerations have limited the et al. 2012). However, although revealing an unprecedented scale of previous studies, such that to date, only a mod- number of regulatory elements in the genome, these stud- est number of elements have been characterized at high ies do not assay the mechanism and functional activity of resolution. these elements. For example, we cannot tell which of the Recent advances in the fields of DNA synthesis and binding sites of a TF affects transcription and in what man- deep sequencing provided a fertile ground for the develop- ner. Genome-wide quantitative measurements of native ment of new high-throughput approaches that address this enhancers were facilitated by recent developed methods technological barrier. These approaches provide the ability such as self-transcribing active regulatory region sequenc- to accurately measure the activity of tens of thousands dif- ing (STARR-Seq) (Arnold et al. 2013). Yet, native enhanc- ferent designed regulatory elements within a single experi- ers differ in many sequence elements making it hard to ment. Although different in the way in which they measure attribute the measured expression differences to any single expression, these methods utilize the power of massive syn- sequence change. Thus, it is difficult to infer systematic thesis of DNA oligonucleotides on a microarray platform, rules of transcriptional grammar solely by quantitatively harvest these oligos as pool and proceed with this com- measuring native sequences. Another approach uses com- plex library as input for a single experiment (Patwardhan putational models for learning the complex combinatorial et al. 2009, 2012; Melnikov et al. 2012; Sharon et al. 2012; code underlying gene expression (Nguyen and D’Haeseleer Kwasnieski et al. 2012) (Fig. 2). The first approach (Pat- 2006; Beer and Tavazoie 2004; Segal et al. 2002). These wardhan et al. 2009, 2012; Melnikov et al. 2012; Kwas- studies utilize mRNA expression data and DNA sequence nieski et al. 2012) is applicable both in-vitro and in-vivo elements in the promoters of the corresponding to and measures expression by the counts of mRNA molecules decipher the effect of motif strength, orientation, and rela- transcribed from each tested sequence using RNA-seq. To tive position on gene expression. However, although com- evaluate transcriptional activity in-vitro (Patwardhan et al. putational studies generate a large number of mechanistic 2009), each oligonucleotide in the library is designed to hypotheses, experimental validation is still required. include a unique barcode sequence downstream to the TSS. One direct and quantitative way to measure the activ- The oligos are transcribed in-vitro, and the resulting tran- ity of regulatory element is to fuse a DNA sequence to a scripts are then sequenced. The relative abundance of each reporter gene and measure its expression with biochemi- barcode provides a digital readout of the transcriptional cal assays such as luciferase assay. Researches have uti- efficiency of its cis-linked designed sequence. This proto- lized this approach successfully to determine the activity of col was recently adapted to measure transcription in-vivo promoters (Juven-Gershon et al. 2006), enhancers (Blanco in mammalian cells (Patwardhan et al. 2012; Melnikov et al. 2005) and insulators (Schwartz et al. 2012; Herold et al. 2012; Kwasnieski et al. 2012). To that end, the oli- et al. 2012). However, the construction of these reporters gonucleotides pool is inserted into plasmids, where each by traditional cloning techniques is slow and labor-inten- plasmid contains one of the designed sequences upstream sive, limiting throughput to at most dozens of regulatory to an arbitrary open reading frame (ORF) followed by a elements per experiment. Several medium-scale (Chiang unique barcode sequence. Plasmids are co-transfected into et al. 2006; Ligr et al. 2006; Kinkhabwala and Guet 2008) cells, where each designed sequence drives the transcrip- and large-scale (Gertz et al. 2009; Cox et al. 2007; Kinney tion of mRNAs containing the barcode in their 3′ UTRs. To et al. 2010; Schlabach et al. 2010) libraries were created in estimate their relative activities, the barcodes in the reporter , yeast and mammalian cells, in which regulatory mRNAs and the plasmid pools are then sequenced and the elements were randomly ligated, mutagenized or synthe- ratios of these counts are calculated. sized in tandem and the expression of the resulting promot- Another approach uses fluorescence to measure expres- ers was measured. These studies provided much insight, but sion in-vivo (Sharon et al. 2012). Although it was devel- their random nature imposes limitations on the repertoire oped in yeast, it can in principle be adapted to mammalian of promoters constructed and thus they cannot systemati- cells. In this approach, each oligonucleotide in the library, cally dissect basic principles of transcriptional grammar. which includes a unique barcode at the 5′ end, is ligated For example, studying the effect of binding site location on into a plasmid upstream of a yellow fluorescent reporter expression requires measurements of promoters that dif- (YFP). The pool of plasmids is then transformed into yeast fer only in the location of the site and sampling many such such that every transformed strain carries one plasmid

1 3 704 Hum Genet (2014) 133:701–711 expressing YFP from one promoter in the library. Using flu- Fig. 2 High-throughput measurements of thousands of systemati-▸ orescence activated cell sorter (FACS), the resulting library cally designed regulatory sequences. Fluorescence-based measure- ments: ssDNA oligos are synthesized on an array and harvested as a of yeast cells is then sorted into ~32 bins according to YFP single pool. The entire library is cloned into a plasmid upstream of a expression. Next, the promoters of each bin are sequenced fluorescent reporter (YFP).A plasmid pool is transformed into yeast such that for every barcode (representing one promoter), such that each cell expresses a single plasmid. Cells are sorted into the total number of reads is counted. Finally, by examin- bins according to their YFP/mCherry ratio using fluorescence acti- vated cell sorter (FACS). Promoters from each bin are then amplified ing the distribution of reads of every promoter across the and sent for deep sequencing. In the final step, the activity of each expression bins, both its mean and standard deviation of sequence in the library is computed by calculating reads distribution expression can be calculated. among expression bins. RNA-seq based measurements: ssDNA oli- Although currently limited by the length of the synthesis gos are synthesized on an array and harvested as a single pool. The entire library is cloned into a plasmid upstream of an open reading DNA oligo (200 bp), these technologies provide ~1,000- frame. Each plasmid carries a unique barcode sequence downstream fold increase in throughput over previous methods, thereby to the ORF. Plasmids can be in-vitro transcribed or transfected into providing new methodologies for accelerating research in mammalian cells for in-vivo studies. Next, plasmids and mRNAs the field of transcriptional regulation. are extracted from cells; barcodes are amplified and sent for deep sequencing. In the final step, the activity of each sequence in the library is computed by calculating the ratio between RNAseq and DNAseq reads number (Ei) Grammatical rules of transcription

TF specificity able to activate transcription when cloned in a plasmid in front of a luciferase gene, indicating that they can interact The first step in understanding the grammar of a language productively with p53 when removed from their native con- is to dissect its most basic building blocks. In transcrip- text and chromatin environment. tional regulation, one of the most important building blocks Despite these advances, the recognition sites of many is binding sites for transcription factors (TFBS). For each TFs are still missing and for the known ones we still do of the ~1,500 TFs in the human genome, we would like to not understand the effect of context, site accessibility, chro- know the length and composition of its sequence prefer- matin, and TF concentration on expression. Some param- ences, which are essential for its binding and eters such as chromatin and TF concentration vary between how sensitive is the TF to sequence alterations in its bind- conditions and are expected to affect binding in a condi- ing site. For any TF that forms homo/hetero-dimers, we tion-specific manner. To date, we cannot tell how many should determine the spacing between its half-sites and any sequence changes in its recognition site can a TF tolerate other constraint that may apply to its sequence. In addition, before it will affect its activity, and whether this number is we should analyze the effect of other elements, which are uniform for most TFs or changes for different TFs families. not directly bound by the TF, on its specificity. Answering such questions requires a systematic analysis In-vitro and in-vivo binding profiles generated by pro- measuring the effect of TF site affinity on expression for a tein-binding microarrays (Berger et al. 2008; Badis et al. large number of TFs. 2009), high-throughput SELEX and ChIP-seq (Jolma et al. 2013) determined the binding specificities of hundred TF activity TFs. However, despite this growing information about TF binding specificity, much less is known about the effect Once a TF is bound to the DNA, it can enhance or repress of TF site affinity on expression.A recent study utilized transcription by promoting or blocking the recruitment of high-throughput measurement technology to compare RNA-polymerase. However, we cannot predict whether the expression of 2,104 enhancers in which the binding the TF functions as an activator or as a repressor from the sites for 7 TFs were varied (Kheradpour et al. 2013). As sequence of its binding site. In addition, the expression lev- expected, this study found that changes in expression cor- els, localization and activity of TFs vary between cell types, relate with the change in motif match score, indicating that different phases of cell cycle, and in response to stress gene expression is correlated with binding affinity when all conditions. Therefore, we cannot tell in advance which TF else is maintained. However, sites with different affinities binding site will affect expression and to what extent. Even for the same TF are rarely placed in the exact same con- the most comprehensive mapping of TF binding sites cannot text, making it hard to study the relationship between bind- provide information about the activity of the TF in each of ing site strength and expression. Indeed, examining binding these sites. To achieve that, direct measurements of expres- sites for the transcription factor p53 in the genome reveals sion are needed in various cell types and conditions. that many sites with predicted high-sequence specificity are Large-scale data from the ENCODE project yielded not bound by p53 (Lidor Nili et al. 2010). These sites were genome-wide binding profiles for ~100 TFs in various

1 3 Hum Genet (2014) 133:701–711 705

Fluorescence-based measurements RNAseq-based measurements

primer Synthetic oligo primer

Clone synthetic Clone synthetic oligos pool upstream oligos pool upstream to YFP reporter to transcribed ORF

minP Barcode minP ORF YFP mCherry Barcode

Transfection into Transformation into mammalian cells as In-Vitro transcription as copy per cell per cell pool

Sort yeast into bins Extract barcoded according to YFP mRNAs and plasmids expression

4 3 5 .. AAAA 2 AAAA 32

Counts 1 AAAA

YFP/mCherry Deep sequencing of Deep sequencing of barcoded mRNAs and promoters from plasmids each bin

DNA-seq Low expression Medium expression High expression RNA-seq

Calculate expression Calculate expression mean and stdv mean

Expression of Expression of promoter #1 promoter #2 # RNA-seq readsi Ei = # DNA-seq readsi # reads # reads Expression bins Expression bins

1 3 706 Hum Genet (2014) 133:701–711 cell types, providing insight on the actual occupancy of be orientation-sensitive to place the RNA-polymerase in binding sites in the human genome(Gerstein et al. 2012). the right direction. To recognize the functional TFBSs in However, the functionality of these binding events was a given promoter sequence, we should know for each TF not determined. A recent study (Whitfield et al. 2012) whether it can activate transcription from one of the DNA combined information gained from these binding pro- strands or from both. files with computational methods to identify TFBS and One way to achieve orientation-insensitive regulation is to measure their activity within a functional assay. The using palindromic sequences as binding sites. Indeed, pal- authors predicted and mutagenized 455 binding sites for indromes are predominant in regulatory elements (FitzGer- six TFs in human promoters and measured expression ald et al. 2004). A palindromic design has two main advan- using luciferase assay in four cell lines. They found that tages. First, can be bound on either strand 30 % of TFBSs were not functional in any cell line and of DNA, thus doubling the local concentration of the TF only 14 % of the sites were verified in all four cell lines. binding site and increasing productive encounters between These results emphasize that the presence of a TFBS the TF and the DNA. Second, palindromes can be bound by itself is not sufficient to determine whether it will be by TFs that work as dimers. However, palindromic binding active in-vivo and highlights the importance of in-vivo sites are flanked by non-palindromic sequences, which in functional measurements of TFBSs. Despite being more some cases contribute to TF binding and are therefore sen- comprehensive than previous studies, this study still sitive to the orientation (Nguyen and D’Haeseleer 2006). examined a total of only six TFs, and due to the native There are also examples for non-palindromic binding sites nature of the tested sequences, the promoter background that can equally activate transcription when placed in the was not kept constant. Thus, differences in activity levels forward or the reverse orientation (Yu et al. 1997). Thus, we cannot be attributed solely to the binding site, as they may cannot predict orientation sensitivity just by the appearance also represent the effect of the context, including binding of palindromic sequence in the TF binding site. A genome- sites for co-activators, GC content, and nucleosome posi- wide computational study investigated the directionality of tioning signals. A more direct assay for comparing activ- promoter motifs by comparing conservation on the forward ity measurements of different TFs involves planting sites and the reverse strands in four mammalian (Xie for different TFs at the exact same location and within the et al. 2005). This study found that conservation of promoter same background sequence. Utilizing a new approach for motifs is largely symmetric on both strands of the DNA in measuring the expression of thousands of designed regu- contrast to 3′-UTR motifs that have a strand preference, latory sequences within one experiment, a recent study suggesting that most TFs are not affected by the orientation compared the activity of 75 different yeast TFs to each of their binding site. Consistent with this finding, a func- other, by separately planting each TF site within the same tional experiment that systematically compared the activ- promoter sequence (Sharon et al. 2012). ity of 75 yeast TF binding sites in two orientations found Despite these advances above, to date we do not know strong effects on expression for only 8 % of the binding the actual fraction of functional TFBSs in a given cell, sites (Sharon et al. 2012). which of these sites act as an activator or a repressor (or Taken together, we do not yet know the orientation sen- both), and how the activity of different TFs compares to sitivity of the ~1,500 TFs in human, providing another hin- one another. One way to answer such questions is to per- drance to models aimed to predict gene expression from form a full survey of all known TF binding site in various sequence. Here too, a systematic study that will place the contexts and conditions through a synthetic high-through- sites of all human TFs in each orientation and within the put approach similar to that recently done in yeast (Sharon same sequence background can provide much insight. et al. 2012). Combinatorial co‑regulation by different TFs The effect of TFBS orientation The architecture of promoters is often composed of binding Unlike regulatory elements playing a role at the mRNA sites for many TFs working together to orchestrate a tran- level, which are restricted to the coding strand of the DNA, scriptional output. TFs can cooperate with each other by regulatory elements in promoters can appear on both direct physical interactions, forming homodimers, heterodi- the coding and non-coding strands. This mechanism has mers, or larger transcriptional complexes (Walhout 2006). some advantages such as doubling the probability to find Many TFs belong to families that bind their target genes TFBS in a specific location and preventing steric interfer- as dimers, such as bZIP, bHLH, or nuclear hormone recep- ence between two neighboring TFs if they bind the DNA tor families (Reece-Hoyes et al. 2005). Interaction among surface from opposite sides. On the other hand, since tran- TFs can be specific to context and condition, resulting in scription is directional, some of these elements should a unique transcriptional program. Since many of these

1 3 Hum Genet (2014) 133:701–711 707 interactions are synergistic, we cannot simply sum up the Few studies set out to investigate the effect of bind- effects that were measured for individual TFs. Dissecting ing site number for one specific TF (Burz et al. 1998; Yu the effect on expression of TF combinations is therefore a et al. 1997). Increasing number of sites for Bicoid led to great challenge. greater-than-additive increase in expression (Giniger and Studies focused on a small number of TFs had shown Ptashne 1988). The same study showed that Bicoid bound that for the same TF, different partners have different to a strong site promotes occupancy of an adjacent weak effects on transcription, resulting in distinct functional site by cooperative DNA binding (Burz et al. 1998). By pathways (Bakiri et al. 2002; Reed et al. 2008). By system- parallel measurements of thousands of synthesized promot- atically mapping all combinatorial protein–protein interac- ers, a recent study in yeast designed synthetic promoters tions for 1,222 TFs in human, 762 interactions were found to systematically test the dependence of expression on the (Ravasi et al. 2010). However, out of these interactions, we number of sites (Sharon et al. 2012). The consensus site for cannot tell which affects transcription and by what mecha- Gcn4 was planted in all possible combinations of 1–7 sites nism. Analysis of Chip-seq data of 119 factors (Gerstein at 7 predefined locations within two different promoter et al. 2012) found a statistical enrichment for specific TF sequences (128 sequences for each context). Examining the combinations, which may suggest that these TFs work average expression of all possible locations for each num- together. Interestingly, the same TF was found to co-asso- ber of sites resulted in a clear relationship between the num- ciate with different partners in different contexts such as ber of sites and the average expression, which accurately gene-proximal and distal regions. fits a logistic function, and in which expression increases Even with the recent progress in mapping TFs interac- with addition of each of the first four sites but then mostly tions at the protein level and studying their co-association saturates. Such saturation in expression was also observed when binding the genome, we do not have a systematic in small-scale studies (Burz et al. 1998; Yu et al. 1997). view of how different combinations of TFs affect gene However, it is unclear whether this saturation is at the level expression. This requires direct expression measurements of binding or activation of transcription. Intriguingly, com- of different constructs containing various combinations of paring the expression of individual promoters with specific TF binding sites. Such measurements, in comparison to combinations of site locations suggests that the relative dis- measurements of each TF separately, could answer which tance between two sites significantly affects expression.A s of the TF pairs works synergistically, whether they act to one example, when located in close vicinity, the TF mol- increase or decrease expression, and to what extent. How- ecules may sterically occlude each other. ever, the space of TF combinations is vast, and system- Future studies are needed to investigate the effect of atically assaying it is challenging even with the recent binding site number for other TFs, to quantitatively char- advances in throughput. acterize the effect of the distance between multiple bind- ing sites, and to understand which biological mechanism The effect of binding site number underlies the sigmoidal behavior.

Multiple binding sites for the same TF, also known as The effect of TFBS distance from the TSS and DNA homotypic clusters of TFBSs, are statistically enriched in helical repeats proximal promoters and distal enhancers. Conservation of such site clusters between vertebrate and invertebrates When addressing the question of how the relative loca- suggests that homotypic clustering is a general organiza- tion of a TF binding site affects transcription, one should tion principle of cis-regulatory regions (Gotea et al. 2010). remember that the DNA is a three-dimensional molecule Using multiple binding sites for the same TF as a mean to and that changing the distance has a geometric aspect. For regulate expression may have several mechanistic advan- example, two regulatory elements that are five nucleotides tages. These include lateral diffusion of a TF binding along apart from each other are also located on opposite sides of a regulatory region (Kim et al. 1987; Khoury et al. 1990; the DNA double helix. Therefore, we should decipher the Coleman and Pugh 1995), high-affinity cooperative binding effects on expression of both the absolute distance of the (Hertel et al. 1997), and functional redundancy (Somma TFBS from the TSS, and the relative angle between the et al. 1991; Papatsenko et al. 2002). To integrate informa- TFBS and the TSS on the DNA double helix. tion about the number of sites when predicting gene expres- A computational genome-wide analysis of TF motifs sion, we need to know the function that relates TFBS num- and gene expression data addressed the first parameter and ber to expression. Does expression increases linearly with computed the effect of absolute distance on gene expres- the number of binding sites? What is the range in which sion (Nguyen and D’Haeseleer 2006). Depending on the adding a binding site still has a substantial effect? Are these TF identity, expression can reach its maximal values when behaviors general or TF-specific? the motif is either within 150 bp from the start codon, at

1 3 708 Hum Genet (2014) 133:701–711 intermediate distance of 150–300 bp, or at long-range dis- the composition of TF sites, their geometry and combinato- tances of 300–450 bp. The second parameter of relative rial interactions is not enough to predict gene expression, angle on the DNA double helix was addressed by changing and the chromatin structure should also be incorporated. the distance between regulatory elements in promoters with For example, a weak binding site may drive higher expres- small increments. of a spacer sequence between sion levels than a strong site if the latter is embedded within TFBS and the TSS yielded an interesting expression pattern a silenced region. The importance of epigenetic context was (Takahashi et al. 1986; Yu et al. 1997), with spacers with recently demonstrated in genome-wide enhancer measure- 5 bp multiplicity (5, 15, 25) having lower expression than ments, which revealed that the same DNA sequence may or spacers with 10 bp multiplicity (10, 20, 30), resulting in a may not act as an enhancer depending on its genomic locus periodic function composed of consecutive peaks whose and epigenetic environment (Arnold et al. 2013). Poly- period is ~10 bp. It was suggested that efficient initiation comb group (PcG) proteins, DNA-methyltransferases and of transcription requires a stereospecific alignment between histone modifiers are recruited to specific target genes to some promoter elements such as the TFBS and the TATA- ensure well-timed and spatially restricted gene-expression box. It appears that the assembly of proteins bound to vari- patterns (Reik 2007; Beisel and Paro 2011). The identifica- ous promoter elements can interact with each other when tion of DNA elements required for the targeting of these located on the same side of the DNA helix. However, while silencing complexes is key to our understanding of epige- some studies support the helical spacing idea, others did not netic dynamic processes. In addition to complexes-medi- reproduce the periodic phenotype. In those cases, increas- ated silencing, DNA sequences have intrinsic preferences ing the distance between a TFBS and the TSS resulted in for nucleosome formation. The DNA segments that are a gradual decrease in expression with no periodic behav- wrapped by nucleosomes are less accessible to most trans- ior (He et al. 2000). Utilizing the high-throughput technol- acting factors such as RNA-polymerase and transcription ogy to systematically investigate the effect of binding site factors. To decipher the effect of nucleosomes on gene location on expression significantly increased the number expression, we should know the DNA sequence prefer- of the tested TFBSs, the number of different promoter con- ences for nucleosome binding, how it affects the accessibil- texts and the resolution of the distance increments (Sharon ity of neighboring cis-regulatory elements such as TF bind- et al. 2012). This higher resolution revealed that even small ing site and TSS, and how this effect depends on distance. 1–7 bp changes in site location could have major effects In the past several decades, it was shown that the affin- on expression. Periodicity in expression was obtained for ity of nucleosomes for different DNA sequences varies one of the designed promoter backgrounds, in which slid- greatly, spanning a range of 5,000-fold between the weak- ing the tested TF site resulted in expression being a ~10 bp est and strongest binding (Gencheva et al. 2006; Thastrom periodic function of site location and the function persisted et al. 2004). This range is thought to reflect differences for six consecutive peaks. Notably, this periodicity was sig- in the ability of DNA sequence to bend around the his- nificant in only one of the two tested promoter contexts for tone octamer into nucleosome structure (Richmond and the same TF, suggesting that some contexts have more flex- Davey 2003). Two sequence features were shown to play ibility in the way in which they can generate stereospecific an opposite role in determining nucleosome affinity. The alignments if these are indeed required. first, which has a positive effect on affinity, consists of These findings leave many open questions regarding the ~10 bp periodicities of specific dinucleotides (Satchwell effect of TFBS location on expression. In which case is ste- et al. 1986; Field et al. 2008). The second feature, which reo alignment required for transcriptional activation? How has a negative effect on nucleosome formation, consists of does it depend on the nature of the TF and the context? Are homopolymeric stretches of deoxyadenosine nucleotides, there cases in which interacting with the DNA helix from referred to as poly(dA:dT) tracts. These disfavoring nucleo- opposite sides is advantageous (e.g., in the case of inde- some sequences were found to be strongly associated with pendent regulators that should not interact with each other nucleosome depletion both in-vivo and in-vitro (reviewed but may sterically occlude each other)? in Segal and Widom 2009), and enrichment of poly(dA:dT) tracts in promoters suggests that they play a regulatory role The effect of epigenetic context and nucleosomes in gene transcription. Indeed, studies of individual genes in yeast showed that poly(dA:dT) tracts stimulate transcrip- In addition to TF binding sites, the transcriptional activity tion by increasing the accessibility of a nearby TF bind- of a gene depends on the local composition and organiza- ing site (Struhl 1985; Iyer and Struhl 1995). A recent study tion of its chromatin environment. Silencing mechanisms in yeast tested the effect of poly(dA:dT) on expression by such as DNA methylation, nucleosome positioning and measuring 70 designed promotes, in which the length, com- histone modifications add epigenetic regulation onto DNA position and distance from TF binding site of poly(dA:dT) without changing the genetic information. Thus, knowing tracts were systematically varied (Raveh-Sadka et al.

1 3 Hum Genet (2014) 133:701–711 709

2012). Notably, manipulating only poly(dA:dT) tracts led are enriched in enhancers and promoters of human and fly to significant effects on gene expression that resemble TF but less so in yeast (Wunderlich and Mirny 2009), suggest- binding site alteration. ing that while mammalian genomes have evolved to tune Advances in technology enabled the design and meas- expression by multiplying binding sites for the same TF, the urements of 777 synthetic promoters dedicated to deci- evolution of yeast may had taken another strategy to regu- phering the effect of nucleosomes on expression (Sharon late expression. Another example is the signal for nucleo- et al. 2012). Adding strong TF binding sites instead of some eviction. While yeast and fly utilize intrinsic physical poly(dA:dT) tracts had a similar effect on expression, sug- properties of AT-rich tracts, mammalian genomes likely rely gesting an alternative mechanism for nucleosome evic- on a different mechanism to achieve the same eviction of tion. In this mechanism, nucleosomes are competed out nucleosomes from their CpG-rich promoters. In addition, by trans-acting factor rather than intrinsic physical prop- different organisms employ various epigenetic strategies to erties of the DNA sequence. Investigation of nucleosome shut down the expression of specific target genes. Some epi- positions in three primary human cell types found that the genetic silencing complexes such as the Sir protein complex positioning signal for nucleosomes is different from the in yeast and Polycomb in fly and mammalian share many 10-bp dinucleotide periodicity observed in other eukaryotes features (Beisel and Paro 2011; Pirrotta and Gross 2005). (Valouev et al. 2011). It is composed of strong G/C cores However, DNA methylation at CpG dinucleotides, which is in the center of the nucleosome (dyad) that are flanked commonly associated with gene silencing in mammalian, is by A/T repelling sequences. Another difference is the GC not a predominant mechanism in fly and yeast (Beisel and content in promoter regions. In contrast to fly and yeast, Paro 2011; Deaton and Bird 2011). Although genomes dif- where AT-rich promoters have intrinsic sequence signals fer in the extent to which they utilize different features of the for nucleosome eviction, human promoters are enriched transcriptional grammar, an intriguing question is whether for CpG islands, intrinsically favorable for nucleosome the behavior of each feature is universal across organisms. formation. However, as in other organisms, promoters of For example, will AT-rich tracts inhibit nucleosome forma- active genes have a nucleosome-free region (NFR) of about tion even in the CpG-rich promoters found in mammalian 150 bp. These observations suggest that CpG-rich segments genomes? Since the physical and biochemical character- in mammalian promoters override intrinsic signals of high istic of TFs and DNA molecules do not change between nucleosome affinity to become active. High-resolution organisms, it is tempting to hypothesize that the principle maps of nucleosome footprints reveal that both CpG and rules governing transcription should be maintained, but this non-CpG promoters are subjected to similar epigenetic reg- awaits further experimentation. As one example, although ulation. These observations support the idea that in addition multiplicity of TFBSs is not a prevalent feature of yeast pro- to intrinsic composition other mechanisms are moters, increasing binding site number led to an increase employed to determine nucleosome positioning in mamma- in expression (Sharon et al. 2012) in a similar fashion as lians (Kelly et al. 2012). observed in mammalian (Yu et al. 1997). The mechanism by which CpG-rich sequences evict nucleosome is still poorly understood. In addition, not much is known about the sequence features that determine Can we solve the grammar of transcription solely higher order chromatin structure and the recruitment of by increasing the throughput of the number silencing complexes such as polycomb. Since higher order of sequences designed and measured? structure has a predominant effect of gene expression in higher eukaryotes, deriving such an understanding should Recent advances in technology provide en efficient mean improve our ability to predict expression from sequence. to design, construct and accurately measure the effect of thousands of regulatory sequences on expression. Fur- ther reductions in DNA synthesis and sequencing costs in Do different organisms follow the same grammatical combination with increased quality and length of the syn- rules? thesized oligos can lead to an era in which experiments will no longer be the major limiting factor. Researchers The letters composing the language of the DNA (A, C, should soon be able to systematically test the effect of vari- G, T) are conserved among all organisms. So is the basic ous elements on gene expression and by that increase our principle of transcription regulation representing a specific understanding of the relative contribution of each regula- interaction between a transcription factor and the DNA tory element to expression. However, it is unclear whether molecule. However, different organisms differ in the set of overcoming this experimental barrier will be sufficient for grammatical rules they “choose” to employ for gene expres- deciphering the grammar of transcriptional regulation. The sion regulation. For example, homotypic clusters of TFBSs main challenge of this approach is its inherent limitation

1 3 710 Hum Genet (2014) 133:701–711 to study regulatory elements that we already know. Many Dostie J, Dekker J (2007) Mapping networks of physical interactions novel insights into transcriptional regulation were achieved between genomic elements using 5C technology. Nat Protoc 2:988–1002 by the discovery of new regulatory mechanisms such as Dunham I et al (2012) An integrated encyclopedia of DNA elements DNA methylation, ncRNA, and nucleosomes. For exam- in the human genome. Nature 489:57–74 ple, the incorporation of nucleosomes and, specifically, Field Y et al (2008) Distinct modes of regulation by chromatin poly(dA:dT) tracts into the computational models improved encoded through nucleosome positioning signals. PLoS Comput Biol 4:e1000216 their performance in predicting expression from sequence FitzGerald PC, Shlyakhtenko A, Mir AA, Vinson C (2004) Clus- (Raveh-Sadka et al. 2012). This discovery of “hidden fac- tering of DNA sequences in human promoters. Genome Res tors” in the DNA sequence is essential to our progress in 14:1562–1574 understanding the grammar of transcription, and such an Gaulton KJ et al (2010) A map of open chromatin in human pancre- atic islets. Nat Genet 42:255–259 understanding cannot be achieved solely by increasing the Gencheva M et al (2006) In Vitro and in Vivo nucleosome position- throughput of sequence measurements. We thus believe that ing on the ovine beta-lactoglobulin gene are related. J Mol Biol successful approaches for deciphering regulatory grammar 361:216–230 should combine powerful synthetic-based methods for dis- Gerstein MB et al (2012) Architecture of the human regulatory net- work derived from ENCODE data. Nature 489:91–100 secting the effect of known regulatory elements as well Gertz J, Siggia ED, Cohen BA (2009) Analysis of combinato- as exploratory unbiased studies that have the potential of rial cis-regulation in synthetic and genomic promoters. Nature uncovering previously unknown regulatory mechanisms. 457:215–218 Giniger E, Ptashne M (1988) Cooperative DNA binding of the Acknowledgments This work was funded by grants to E.S. from yeast transcriptional activator GAL4. Proc Natl Acad Sci USA the National Institute of Health (NIH) and the European Research 85:382–386 Council (ERC). S.WG. is a Clore scholar. Gotea V et al (2010) Homotypic clusters of transcription factor bind- ing sites are a key component of human promoters and enhanc- ers. Genome Res 20:565–577 He X, Hohn T, Futterer J (2000) Transcriptional activation of the rice tungro bacilliform virus gene is critically dependent on an acti- References vator element located immediately upstream of the TATA box. J Biol Chem 275:11799–11808 Arnold CD et al (2013) Genome-wide quantitative enhancer activity Herold M, Bartkuhn M, Renkawitz R (2012) CTCF: insights maps identified by STARR-seq. Science 339:1074–1077 into insulator function during development. Development Badis G et al (2009) Diversity and complexity in DNA recognition by 139:1045–1057 transcription factors. Science 324:1720–1723 Hertel KJ, Lynch KW, Maniatis T (1997) Common themes in the Bakiri L, Matsuo K, Wisniewska M, Wagner EF, Yaniv M (2002) Pro- function of transcription and splicing enhancers. Curr Opin Cell moter specificity and biological activity of tetheredA P-1 dimers. Biol 9:350–357 Mol Cell Biol 22:4952–4964 Iyer V, Struhl K (1995) Poly(dA:dT), a ubiquitous promoter ele- Beer MA, Tavazoie S (2004) Predicting gene expression from ment that stimulates transcription via its intrinsic DNA structure. sequence. Cell 117:185–198 EMBO J 14:2570–2579 Beisel C, Paro R (2011) Silencing chromatin: comparing modes and Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome- mechanisms. Nat Rev Genet 12:123–135 wide mapping of in vivo protein-DNA interactions. Science Berger MF et al (2008) Variation in homeodomain DNA binding 316:1497–1502 revealed by high-resolution analysis of sequence preferences. Jolma A et al (2013) DNA-binding specificities of human transcrip- Cell 133:1266–1276 tion factors. Cell 152:327–339 Blanco J, Girard F, Kamachi Y, Kondoh H, Gehring WJ (2005) Func- Juven-Gershon T, Cheng S, Kadonaga JT (2006) Rational design of a tional analysis of the chicken delta1-crystallin enhancer activ- super core promoter that enhances gene expression. Nat Methods ity in Drosophila reveals remarkable evolutionary conservation 3:917–922 between chicken and fly. Development 132:1895–1905 Kelly TK et al (2012) Genome-wide mapping of nucleosome posi- Burz DS, Rivera-Pomar R, Jackle H, Hanes SD (1998) Cooperative tioning and DNA methylation within individual DNA molecules. DNA-binding by Bicoid provides a mechanism for threshold- Genome Res 22:2497–2506 dependent gene activation in the Drosophila embryo. EMBO J Kheradpour P et al (2013) Systematic dissection of regulatory motifs 17:5998–6009 in 2000 predicted human enhancers using a massively parallel Chiang DY, Nix DA, Shultzaberger RK, Gasch AP, Eisen MB (2006) reporter assay. Genome Res 23:800–811 Flexible promoter architecture requirements for coactivator Khoury AM, Lee HJ, Lillis M, Lu P (1990) Lac repressor-operator recruitment. BMC Mol Biol 7:16 interaction: DNA length dependence. Biochim Biophys Acta Coleman RA, Pugh BF (1995) Evidence for functional binding and 1087:55–60 stable sliding of the TATA binding protein on nonspecific DNA. J Kim JG, Takeda Y, Matthews BW, Anderson WF (1987) Kinetic Biol Chem 270:13850–13859 studies on Cro repressor-operator DNA interaction. J Mol Biol Cox RS 3rd, Surette MG, Elowitz MB (2007) Programming gene 196:149–158 expression with combinatorial promoters. Mol Syst Biol 3:145 Kinkhabwala A, Guet CC (2008) Uncovering cis regulatory codes Crawford GE et al (2006) Genome-wide mapping of DNase hyper- using synthetic promoter shuffling. PLoS One 3:e2030 sensitive sites using massively parallel signature sequencing Kinney JB, Murugan A, Callan CG Jr, Cox EC (2010) Using deep (MPSS). Genome Res 16:123–131 sequencing to characterize the biophysical mechanism of a Deaton AM, Bird A (2011) CpG islands and the regulation of tran- transcriptional regulatory sequence. Proc Natl Acad Sci USA scription. Genes Dev 25:1010–1022 107:9158–9163

1 3 Hum Genet (2014) 133:701–711 711

Kwasnieski JC, Mogno I, Myers CA, Corbo JC, Cohen BA (2012) Satchwell SC, Drew HR, Travers AA (1986) Sequence periodicities in Complex effects of nucleotide variants in a mammalian cis-regu- chicken nucleosome core DNA. J Mol Biol 191:659–675 latory element. Proc Natl Acad Sci USA 109:19498–19503 Schlabach MR, Hu JK, Li M, Elledge SJ (2010) Synthetic design of Levine M, Tjian R (2003) Transcription regulation and animal diver- strong promoters. Proc Natl Acad Sci USA 107:2538–2543 sity. Nature 424:147–151 Schwartz YB et al (2012) Nature and function of insulator pro- Lidor Nili E et al (2010) p53 binds preferentially to genomic regions tein binding sites in the Drosophila genome. Genome Res with high DNA-encoded nucleosome occupancy. Genome Res 22:2188–2198 20:1361–1368 Segal E, Widom J (2009) Poly(dA:dT) tracts: major determinants of Ligr M, Siddharthan R, Cross FR, Siggia ED (2006) Gene expres- nucleosome organization. Curr Opin Struct Biol 19:65–71 sion from random libraries of yeast promoters. Segal E, Barash Y, Simon I, Friedman N, Koller D (2002) From pro- 172:2113–2122 moter sequence to expression: a probabilistic framework. In: Melnikov A et al (2012) Systematic dissection and optimization of Proceeding 6th international conference on research in computa- inducible enhancers in human cells using a massively parallel tional (RECOMB), Washington, DC reporter assay. Nat Biotechnol 30:271–277 Sharon E et al (2012) Inferring gene regulatory logic from high- Morin R et al (2008) Profiling the HeLa S3 transcriptome using ran- throughput measurements of thousands of systematically domly primed cDNA and massively parallel short-read sequenc- designed promoters. Nat Biotechnol 30:521–530 ing. Biotechniques 45:81–94 Somma MP, Pisano C, Lavia P (1991) The housekeeping promoter Nguyen DH, D’Haeseleer P (2006) Deciphering principles of tran- from the mouse CpG island HTF9 contains multiple protein- scription regulation in eukaryotic genomes. Mol Syst Biol 2:2006 binding elements that are functionally redundant. Nucleic Acids 0012 Res 19:2817–2824 Papatsenko DA et al (2002) Extraction of functional binding sites Struhl K (1985) Naturally occurring poly(dA-dT) sequences are from unique regulatory regions: the Drosophila early develop- upstream promoter elements for constitutive transcription in mental enhancers. Genome Res 12:470–481 yeast. Proc Natl Acad Sci USA 82:8419–8423 Patwardhan RP et al (2009) High-resolution analysis of DNA regula- Takahashi K et al (1986) Requirement of stereospecific alignments tory elements by synthetic saturation mutagenesis. Nat Biotech- for initiation from the simian virus 40 early promoter. Nature nol 27:1173–1175 319:121–126 Patwardhan RP et al (2012) Massively parallel functional dissection Thastrom A, Bingham LM, Widom J (2004) Nucleosomal locations of mammalian enhancers in vivo. Nat Biotechnol 30:265–270 of dominant DNA sequence motifs for histone-DNA interactions Pirrotta V, Gross DS (2005) Epigenetic silencing mechanisms in bud- and nucleosome positioning. J Mol Biol 338:695–709 ding yeast and fruit fly: different paths, same destinations. Mol Valouev A et al (2011) Determinants of nucleosome organization in Cell 18:395–398 primary human cells. Nature 474:516–520 Ravasi T et al (2010) An atlas of combinatorial transcriptional regula- Walhout AJ (2006) Unraveling transcription regulatory networks by tion in mouse and man. Cell 140:744–752 protein-DNA and protein–protein interaction mapping. Genome Raveh-Sadka T et al (2012) Manipulating nucleosome disfavoring Res 16:1445–1454 sequences allows fine-tune regulation of gene expression in yeast. Whitfield TW et al (2012) Functional analysis of transcription factor Nat Genet 44:743–750 binding sites in human promoters. Genome Biol 13:R50 Reece-Hoyes JS et al (2005) A compendium of Caenorhabditis ele- Wunderlich Z, Mirny LA (2009) Different gene regulation strategies gans regulatory transcription factors: a resource for mapping revealed by analysis of binding motifs. Trends Genet 25:434–440 transcription regulatory networks. Genome Biol 6:R110 Xie X et al (2005) Systematic discovery of regulatory motifs in Reed BD, Charos AE, Szekely AM, Weissman SM, Snyder M (2008) human promoters and 3′ UTRs by comparison of several mam- Genome-wide occupancy of SREBP1 and its partners NFY and mals. Nature 434:338–345 SP1 reveals novel functional roles and combinatorial regulation Yu M et al (1997) GA-binding protein-dependent transcription ini- of distinct classes of genes. PLoS Genet 4:e1000133 tiator elements. Effect of helical spacing between polyomavirus Reik W (2007) Stability and flexibility of epigenetic gene regulation enhancer a factor 3(PEA3)/Ets-binding sites on initiator activity. in mammalian development. Nature 447:425–432 J Biol Chem 272:29060–29067 Richmond TJ, Davey CA (2003) The structure of DNA in the nucleo- some core. Nature 423:145–150

1 3