What Are DNA Sequence Motifs?

PRIMER What are DNA sequence motifs? Patrik D’haeseleer Sequence motifs are becoming increasingly important in the analysis of gene regulation. How do we define sequence motifs, and why should we use sequence logos instead of consensus sequences to represent them? Do they have any relation with binding affinity? How do we search for new instances of a motif in this sea of DNA? Sequence motifs are short, recurring patterns Restriction enzymes and consensus in DNA that are presumed to have a biologi- sequences cal function. Often they indicate sequence- a HEM13 CCCATTGTTCTC Type II restriction enzymes, discovered in the specific binding sites for proteins such as HEM13 TTTCTGGTTCTC late 1960s, need to bind to their DNA targets http://www.nature.com/naturebiotechnology nucleases and transcription factors (TF). in a highly sequence-specific manner, because Others are involved in important processes at HEM13 TCAATTGTTTAG they are part of a primitive bacterial immune the RNA level, including ribosome binding, ANB1 CTCATTGTTGTC system designed to chop up viral DNA from mRNA processing (splicing, editing, polyad- infecting phages. Straying from their con- enylation) and transcription termination. ANB1 TCCATTGTTCTC sensus binding site specificity would be the In the past, binding sites were typically ANB1 CCTATTGTTCTC equivalent of an autoimmune reaction that determined through DNase footprinting, and could lead to irreversible damage to the bacte- gel-shift or reporter construct assays, whereas ANB1 TCCATTGTTCGT rial genome. For example, EcoRI binds to the binding affinities to artificial sequences were ROX1 CCAATTGTTTTG 6-mer GAATTC, and only to that sequence. explored using SELEX. Nowadays, com- Note that this motif is a palindrome, reflect- putational methods are generating a flood ing the fact that the EcoRI protein binds to Nature Publishing Group Group Nature Publishing b YCHATTGTTCTC 6 of putative regulatory sequence motifs by the DNA as a homodimer. Other restriction searching for overrepresented (and/or con- enzymes bind to a degenerate consensus 200 c A 002700000010 © served) DNA patterns upstream of function- sequence. For example, HindII bind to the ally related genes (for example, genes with C 464100000505 sequences GTYRAC, where Y stands for ‘C similar expression patterns or similar func- G 000001800112 or T’ (pYrimidine), and R stands for ‘A or G’ tional annotation). For a while, it seemed T 422087088261 (puRine). (See http://www.chem.qmul.ac.uk/ like we had more computationally predicted iubmb/misc/naseq.html#tab1 for a listing of sequence motifs without a known match- the IUPAC symbols for degenerate consensus d 8.0 ing transcription factor, than transcription 4.0 sequences.) factors without a known binding sequence, Counts 0.0 We can calculate how often we would although large-scale efforts to analyze the 5' 3' expect these consensus sequences to occur, genome-wide binding of transcription fac- e 2.0 based on their length and degeneracy. The 1.0 tors using ChIP-chip are rapidly rectifying Bits probability that a random 6-mer matches 0.0 6 this situation. 5' 3' the EcoRI binding site is (1/4) , so the site The abundance of both computationally occurs about once every 46 (= 4,096) bp in a f 2.0 and experimentally derived sequence motifs 1.0 random DNA sequence. The HindII binding Bits Bob Crimi and their growing usefulness in defining 0.0 site, containing two positions where two out genetic regulatory networks and deciphering 5' 3' of four bases can match, would occur once the regulatory program of individual genes per 44 × 22 (= 1,024) bp. make them important tools for computa- Figure 1 ROX1 binding sites and sequence motif. tional biology in the post-genomic era. (a) Eight known genomic binding sites in three Consensus or caricature? S. cerevisiae genes. (b) Degenerate consensus Other DNA binding proteins tend to be less Patrik D’haeseleer is in the Microbial Systems sequence. (c,d) Frequencies of nucleotides at picky in sequence specificity. In 1975, Pribnow each position. (e) Sequence logo showing the Division, Biosciences Directorate, Lawrence frequencies scaled relative to the information discovered the ‘TATAAT box,’ a well-conserved Livermore National Laboratory, PO Box 808, content (measure of conservation) at each position. sequence centered around 10 bp upstream of L-448, Livermore, California 94551, USA (f) Energy normalized logo using relative entropy to the transcription initiation site of Escherichia e-mail: [email protected] adjust for low GC content in S. cerevisiae. coli promoters. This motif, together with a NATURE BIOTECHNOLOGY VOLUME 24 NUMBER 4 APRIL 2006 423 PRIMER TTGACA motif centered around –35, forms to be applied. This explains why the central alignment: Steven Brenner’s WebLogo the binding site for the σ70 subunit of the positions of the motif in Figure 1e show an (http://weblogo.berkeley.edu/), implement- core RNA polymerase. However, despite the information content of less than 2 bits, even ing Schneider’s original Sequence Logos, high degree of conservation at each position though they are perfectly conserved within and the more recent enoLOGOS3 (http:// (ranging from 54% to 82% for each base), it is the eight known binding sites. biodev.hgen.pitt.edu/enologos), using rela- actually extremely rare to find a promoter that Note that the total information content of tive entropy. The former provides an option matches this consensus sequence exactly, with a motif is directly related to its expected fre- to put error bars on the information con- most promoters matching only 7–9 out of the quency of occurring within a random DNA tent, which can be quite useful especially for 12 bases. Rather than representing a typical sequence. For example, the information con- motifs based on a small number of sequences. binding sequence, the consensus sequence in tent of the partially degenerate 6-mer HindII However, the latter offers a wider variety of this case is instead a highly unusual sequence. binding site is 10 bits (2 bits per conserved input formats, variable GC content, and the It turns out that the activity of each promoter base, 1 bit per double-degenerate position), option to examine nonindependent bases via is related to how well it matches the consen- and its expected frequency in random DNA mutual information. The two sites also take sus sequence, so the activity level of each gene is 1 in 210 = 1,024. a different approach to small-sample correc- can be fine-tuned by how much its –10 and tion. The logos in Figure 1 were generated –35 regions deviate from the consensus. Correcting for background frequencies using enoLOGOS. A better description of the binding Equation (1) assumes all four bases occur Transcription factor binding sites are sequence in this case is through a Position equally often in the background genomic collected in a number of online databases, Frequency Matrix (PFM). Rather than only DNA. For organisms such as E.coli (51% GC) including TRANSFAC4 (http://www.gene- keeping track of the most common base at or human (41%) this is usually a reasonable regulation.com/pub/databases.html), each position, we record how often each approximation. However, for genomes with JASPAR5 for multicellular eukaryotes base occurs in known sites. For example, a more biased GC content such as S. cere- (http://jaspar.genereg.net), YEASTRACT6 the Rox1 transcription factor is known to visiae (38%), Caenorhabditis elegans (36%) (http://www.yeastract.com/) and SCPD7 http://www.nature.com/naturebiotechnology bind at least eight sites in three genes in the and especially extremes such as Plasmodium (http://rulai.cshl.edu/SCPD) for S. cere- Saccharomyces cerevisiae genome. Figure 1 falciparum (19%) or Streptomyces coelicolor visiae, RegulonDB8 for E. coli (http:// shows the multiple alignment of these eight (72%), a correction factor is needed. One regulondb.ccg.unam.mx) and PRODORIC9 binding sites, with a consensus sequence of approach—advocated by Schneider—is to for prokaryotes (http://www.prodoric.de/), YCHATTGTTCTC. (Conventionally, a single replace the ‘2’ in equation (1) with the lower although some of these are still focused pri- base is shown if it occurs in more than half entropy of random DNA of the specified GC marily on consensus sequences. the sites and at least twice as often as the sec- content. A more informative approach2 is to ond most frequent base. Otherwise, a double- generalize equation (1) to the relative entropy Binding energy and searching for degenerate symbol is used if two bases occur (a.k.a. Kullback-Leibler distance) of the bind- novel sites in more than 75% of the sites, or a triple- ing site with respect to the background fre- As mentioned above, the affinity of a DNA degenerate symbol when one base does not quencies: binding protein to a specific binding site is Nature Publishing Group Group Nature Publishing 6 occur at all.) The frequency matrix and its typically correlated with how well the site graphic representation in Figure 1 clearly matches the consensus sequence. However, 200 f I (i) = –Σ f log b,i © show a core motif of ATTGTT, with much seq b,i 2 (2) not all positions in a binding site are equally b pb lower conservation in the flanking bases. forgiving of mismatches, and not all mismatches at a given position have the same Sequence logos where pb is the background frequency of effect. By scaling each stack of letters in Figure 1d base b in the genome. This is equivalent to If we assume that each position contrib- with some measure of the conservation at a log-likelihood ratio (G test) to measure utes to the binding energy independently (a each base, we get a much clearer view of the the degree of disagreement between the reasonable approximation in most cases), we binding sequence.

What Are DNA Sequence Motifs?

DNA Sequencing and Sorting: Identifying Genetic Variations

Representation for Discovery of Protein Motifs

Introduction to Sequence Motif Discovery and Search the Complete

Tools for Motif and Pattern Searching

Degsampler: Gibbs Sampling Strategy for Predicting E3-Binding Sites with Position-Specific Prior Information

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

A Tool for Detecting Base Mis-Calls in Multiple Sequence Alignments by Semi-Automatic Chromatogram Inspection

A Brief History of Sequence Logos

Sequence Conservation

Bioinformatics Manual Sequence Data Analysis with CLC Main Workbench

A STAT Protein Domain That Determines DNA Sequence Recognition Suggests a Novel DNA-Binding Domain

Bioinformatics I Sanger Sequence Analysis