<<

PRIMER

What are DNA sequence motifs?

Patrik D’haeseleer

Sequence motifs are becoming increasingly important in the analysis of regulation. How do we define sequence motifs, and why should we use sequence logos instead of consensus sequences to represent them? Do they have any relation with binding affinity? How do we search for new instances of a motif in this sea of DNA?

Sequence motifs are short, recurring patterns Restriction enzymes and consensus in DNA that are presumed to have a biologi- sequences cal function. Often they indicate sequence- a HEM13 CCCATTGTTCTC Type II restriction enzymes, discovered in the specific binding sites for such as HEM13 TTTCTGGTTCTC late 1960s, need to bind to their DNA targets

http://www.nature.com/naturebiotechnology nucleases and transcription factors (TF). in a highly sequence-specific manner, because Others are involved in important processes at HEM13 TCAATTGTTTAG they are part of a primitive bacterial immune the RNA level, including ribosome binding, ANB1 CTCATTGTTGTC system designed to chop up viral DNA from mRNA processing (splicing, editing, polyad- infecting phages. Straying from their con- enylation) and transcription termination. ANB1 TCCATTGTTCTC sensus binding site specificity would be the In the past, binding sites were typically ANB1 CCTATTGTTCTC equivalent of an autoimmune reaction that determined through DNase footprinting, and could lead to irreversible damage to the bacte- gel-shift or reporter construct assays, whereas ANB1 TCCATTGTTCGT rial . For example, EcoRI binds to the binding affinities to artificial sequences were ROX1 CCAATTGTTTTG 6-mer GAATTC, and only to that sequence. explored using SELEX. Nowadays, com- Note that this motif is a , reflect- putational methods are generating a flood YCHATTGTTCTC ing the fact that the EcoRI binds to of putative motifs by b the DNA as a homodimer. Other restriction searching for overrepresented (and/or con- enzymes bind to a degenerate consensus Nature Publishing Group Group 200 6 Nature Publishing c A 002700000010

© served) DNA patterns upstream of function- sequence. For example, HindII bind to the ally related (for example, genes with C 464100000505 sequences GTYRAC, where Y stands for ‘C similar expression patterns or similar func- G 000001800112 or T’ (pYrimidine), and R stands for ‘A or G’ tional annotation). For a while, it seemed T 422087088261 (puRine). (See http://www.chem.qmul.ac.uk/ like we had more computationally predicted iubmb/misc/naseq.html#tab1 for a listing of sequence motifs without a known match- the IUPAC symbols for degenerate consensus d 8.0 ing , than transcription 4.0 sequences.)

factors without a known binding sequence, Counts 0.0 We can calculate how often we would although large-scale efforts to analyze the 5' 3' expect these consensus sequences to occur, genome-wide binding of transcription fac- e 2.0 based on their length and degeneracy. The 1.0 tors using ChIP-chip are rapidly rectifying Bits probability that a random 6-mer matches 0.0 6 this situation. 5' 3' the EcoRI binding site is (1/4) , so the site The abundance of both computationally occurs about once every 46 (= 4,096) bp in a f 2.0 and experimentally derived sequence motifs 1.0 random DNA sequence. The HindII binding Bits Bob Crimi and their growing usefulness in defining 0.0 site, containing two positions where two out genetic regulatory networks and deciphering 5' 3' of four bases can match, would occur once the regulatory program of individual genes per 44 × 22 (= 1,024) bp. make them important tools for computa- Figure 1 ROX1 binding sites and . tional biology in the post-genomic era. (a) Eight known genomic binding sites in three Consensus or caricature? S. cerevisiae genes. (b) Degenerate consensus Other DNA binding proteins tend to be less Patrik D’haeseleer is in the Microbial Systems sequence. (c,d) Frequencies of at picky in sequence specificity. In 1975, Pribnow each position. (e) showing the Division, Biosciences Directorate, Lawrence frequencies scaled relative to the information discovered the ‘TATAAT box,’ a well-conserved Livermore National Laboratory, PO Box 808, content (measure of conservation) at each position. sequence centered around 10 bp upstream of L-448, Livermore, California 94551, USA (f) Energy normalized logo using relative entropy to the transcription initiation site of Escherichia e-mail: [email protected] adjust for low GC content in S. cerevisiae. coli promoters. This motif, together with a

NATURE BIOTECHNOLOGY VOLUME 24 NUMBER 4 APRIL 2006 423 PRIMER

TTGACA motif centered around –35, forms to be applied. This explains why the central alignment: Steven Brenner’s WebLogo the binding site for the σ70 subunit of the positions of the motif in Figure 1e show an (http://weblogo.berkeley.edu/), implement- core RNA polymerase. However, despite the information content of less than 2 bits, even ing Schneider’s original Sequence Logos, high degree of conservation at each position though they are perfectly conserved within and the more recent enoLOGOS3 (http:// (ranging from 54% to 82% for each base), it is the eight known binding sites. biodev.hgen.pitt.edu/enologos), using rela- actually extremely rare to find a that Note that the total information content of tive entropy. The former provides an option matches this exactly, with a motif is directly related to its expected fre- to put error bars on the information con- most promoters matching only 7–9 out of the quency of occurring within a random DNA tent, which can be quite useful especially for 12 bases. Rather than representing a typical sequence. For example, the information con- motifs based on a small number of sequences. binding sequence, the consensus sequence in tent of the partially degenerate 6-mer HindII However, the latter offers a wider variety of this case is instead a highly unusual sequence. binding site is 10 bits (2 bits per conserved input formats, variable GC content, and the It turns out that the activity of each promoter base, 1 bit per double-degenerate position), option to examine nonindependent bases via is related to how well it matches the consen- and its expected frequency in random DNA mutual information. The two sites also take sus sequence, so the activity level of each gene is 1 in 210 = 1,024. a different approach to small-sample correc- can be fine-tuned by how much its –10 and tion. The logos in Figure 1 were generated –35 regions deviate from the consensus. Correcting for background frequencies using enoLOGOS. A better description of the binding Equation (1) assumes all four bases occur Transcription factor binding sites are sequence in this case is through a Position equally often in the background genomic collected in a number of online databases, Frequency Matrix (PFM). Rather than only DNA. For organisms such as E.coli (51% GC) including TRANSFAC4 (http://www.gene- keeping track of the most common base at or human (41%) this is usually a reasonable regulation.com/pub/databases.html), each position, we record how often each approximation. However, for with JASPAR5 for multicellular eukaryotes base occurs in known sites. For example, a more biased GC content such as S. cere- (http://jaspar.genereg.net), YEASTRACT6 the Rox1 transcription factor is known to visiae (38%), Caenorhabditis elegans (36%) (http://www.yeastract.com/) and SCPD7

http://www.nature.com/naturebiotechnology bind at least eight sites in three genes in the and especially extremes such as Plasmodium (http://rulai.cshl.edu/SCPD) for S. cere- Saccharomyces cerevisiae genome. Figure 1 falciparum (19%) or Streptomyces coelicolor visiae, RegulonDB8 for E. coli (http:// shows the multiple alignment of these eight (72%), a correction factor is needed. One regulondb.ccg.unam.mx) and PRODORIC9 binding sites, with a consensus sequence of approach—advocated by Schneider—is to for prokaryotes (http://www.prodoric.de/), YCHATTGTTCTC. (Conventionally, a single replace the ‘2’ in equation (1) with the lower although some of these are still focused pri- base is shown if it occurs in more than half entropy of random DNA of the specified GC marily on consensus sequences. the sites and at least twice as often as the sec- content. A more informative approach2 is to ond most frequent base. Otherwise, a double- generalize equation (1) to the relative entropy Binding energy and searching for degenerate symbol is used if two bases occur (a.k.a. Kullback-Leibler distance) of the bind- novel sites in more than 75% of the sites, or a triple- ing site with respect to the background fre- As mentioned above, the affinity of a DNA degenerate symbol when one base does not quencies: binding protein to a specific binding site is occur at all.) The frequency matrix and its typically correlated with how well the site graphic representation in Figure 1 clearly matches the consensus sequence. However, Nature Publishing Group Group 200 6 Nature Publishing f I (i) = –Σ f log b,i © show a core motif of ATTGTT, with much seq b,i 2 (2) not all positions in a binding site are equally b pb lower conservation in the flanking bases. forgiving of mismatches, and not all mis- matches at a given position have the same Sequence logos where pb is the background frequency of effect. By scaling each stack of letters in Figure 1d base b in the genome. This is equivalent to If we assume that each position contrib- with some measure of the conservation at a log-likelihood ratio (G test) to measure utes to the binding energy independently (a each base, we get a much clearer view of the the degree of disagreement between the reasonable approximation in most cases), we binding sequence. In a ‘sequence logo,’ devel- observed and background base frequencies, could laboriously measure the effect on bind- oped by Schneider and Stephens1, each stack and thus can again be used to calculate the ing energy of all possible single base changes. is scaled with the information content of the significance of the motif itself (and the fre- The resulting (PWM) base frequencies at that position: quency of occurrence of such a sequence in W(b,i) can then be used to calculate the spe- random DNA). cific-binding free energy (relative to random Figure 1f shows the Rox1 binding motif, background DNA) of a sequence S as: I = 2 + Σ f log f (1) corrected for the GC-content of S. cerevi- where S(i) is the base occurring in position i i b b,i 2 b,i siae genomic DNA using equation (2). In in sequence S. comparison with e, the central G base now where fb,i indicates the frequency of base b at carries more information than the flanking A position i. Positions that are perfectly con- and T bases, reflecting the fact that its occur- – ∆G (S) = Σ W(S(i),i) (3) s i served contain 2 bits of information, those rence is much more significant in the low-GC where two of the four bases occur 50% of the genome. The total information content of the time each contain 1 bit, and positions where motif is Iseq = 11.27 bits. Typically, we only have a list of known all four bases occur equally often contain no binding sites, without any affinity informa- information. Note that for a small sample, the Roll your own logos tion. If we assume that the genomic DNA is information content will tend to be overes- Two free web interfaces exist for generating random with base frequencies pb, it is pos- timated, so a small-sample correction needs a sequence logo from your favorite DNA sible to optimize the values in the PWM such

424 VOLUME 24 NUMBER 4 APRIL 2006 NATURE BIOTECHNOLOGY PRIMER

that the probability of binding to the known thousands of putative binding sites. A prom- 3. Workman, C.T. et al. EnoLOGOS: a versatile web tool binding sites (versus the more abundant ising alternative developed by Djordjevic et for energy normalized sequence logos. Nucleic Acids Res. 33 (Web Server Issue), W389–W392 (2005). 10 background DNA) is maximized. The opti- al. is to simultaneously optimize the weight 4. Matys, V. et al. TRANSFAC® and its module mal weight matrix is then given by: matrix as well as the threshold such that all TRANSCompel®: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34 suppl. Database the known sites are included, but as few other issue, D108–D110 (2006). f sites as possible, typically resulting in many 5. Vlieghe, D. et al. A new generation of JASPAR, the b,i open-access repository for transcription factor binding W(b,i) = log2 (4) fewer and more reliable novel sites. pb site profiles. Nucleic Acids Res. 34 suppl. Database Nevertheless, it is unavoidable that issue, D95–D97 (2006). sequence motifs with low information con- 6. Teixeira, M.C. et al. The YEASTRACT database: a tool for the analysis of transcription regulatory associations The information content Iseq can then be tent (that is, short motifs, and/or with a lot in Saccharomyces cerevisiae. Nucleic Acids Res. 34 interpreted as an estimate of the average of degeneracy) will tend to yield large num- suppl. Database issue, D446-451 (2006). specific binding energy to the entire set of bers of fairly low affinity hits, especially 7. Zhu, J. & Zhang, M.Q. SCPD: a promoter database of known binding sites, in competition with the in large eukaryotic genomes. Presumably the yeast Saccharomyces cerevisiae. 15, 607–611 (1999). genomic DNA. other factors such as chromatin structure 8. Salgado, H. et al. RegulonDB (version 5.0): Escherichia This PWM can be used to search for novel and cooperative binding play a role as well coli K-12 transcriptional regulatory network, sites with high predicted binding affin- in determining the in vivo specificity of the organization, and growth conditions. Nucleic Acids Res. 34 suppl. Database issue, D394–D397 (2006). ity within the rest of the genome, typically associated transcription factors. 9. Munch, R. et al. PRODORIC: prokaryotic database using a score threshold based on the scores of 1. Schneider, T.D. & Stephens, R.M. Sequence Logos: of gene regulation. Nucleic Acids Res. 31, 266–269 the known binding sites. Unfortunately, this a new way to display consensus sequences. Nucleic (2003). Acids Res. 18, 6097–6100 (1990). 10. Djordjevic, M., Sengupta, A.M. & Shraiman, B.I. A bio- approach can result in large numbers of false 2. Stormo, G.D. DNA binding sites: representation and physical approach to transcription factor binding site positives, sometimes returning hundreds or discovery. Bioinformatics 16, 16–23 (2000). discovery. Genome Res. 13, 2381–2390 (2003). http://www.nature.com/naturebiotechnology Nature Publishing Group Group 200 6 Nature Publishing ©

NATURE BIOTECHNOLOGY VOLUME 24 NUMBER 4 APRIL 2006 425