How Does DNA Sequence Motif Discovery Work?
Total Page:16
File Type:pdf, Size:1020Kb
PRIMER How does DNA sequence motif discovery work? Patrik D’haeseleer How can we computationally extract an unknown motif from a set of target sequences? What are the principles behind the major motif discovery algorithms? Which of these should we use, and how do we know we’ve found a ‘real’ motif? Extracting regulatory motifs1 from DNA of occurrences of all n-mers in the target 3-nucleotide degenerate positions in the 2 http://www.nature.com/naturebiotechnology sequences seems to be all the rage these days. sequences, and calculate which ones are motif . Another enumerative approach Take your favorite cluster of coexpressed genes, most overrepresented. A motif description describes a motif as a consensus sequence and with some luck you might hope to find a based on exact occurrence of specific words and an allowed number of mismatches, and short pattern of nucleotides upstream of the is too rigid for most real-world binding sites, uses an efficient suffix tree representation to transcription start sites of these genes, indi- but a number of similar overrepresented find all such motifs in the target sequences3. cating a common transcription factor binding words may be combined into a more flexi- Enumerative methods cover the entire site responsible for their coordinate regulation. ble motif description. Alternatively, one can search space, and therefore do not run the Easier said than done—the hunt for such a search the space of all degenerate consensus risk of getting stuck in a local optimum. On common motif may be like searching for the sequences up to a given length, for example, the other hand, the abstractions needed to proverbial needle in a haystack. Consider the using IUPAC codes for 2-nucleotide or achieve an enumerable search space may complexity of searching for imperfect copies of Nature Publishing Group Group Nature Publishing 6 an unknown pattern, perhaps as small as 6–8 base pairs, occurring potentially thousands of 200 Sites in target sequences Motif model © bases upstream of some unknown subset of our genes of interest. AATCAGTTATCTGTTGTATACCCGGAGTCC Let us first consider the problem in its most AGGTCGAATGCAAAACGGTTCTTGCACGTA basic form. Given a set of sequences, which we GAGATAACCGCTTGATATGACTCATTTGCC have good reason to believe share a common ATATTCCGGACGCTGTGACGATCCGGTTTG binding motif, how do we go about extract- GAACGCAACCAGTTCAGTGCTTATCATGAA ing these often degenerate patterns from the TTATTTAA GGGCAGGAGG CACACTGCACCTCT set of sequences? Motif discovery algorithms subdivide into three distinct approaches: AATCAGTTATCTGTTGTATACCCGGAGTCC enumeration, deterministic optimization and AGGTCGAATGCAAAACGGTTCTTGCACGTA probabilistic optimization. GAGATAACCGCTTGATATGACTCATTTGCC ATATTCCGGACGCTGTGACGATCCGGTTTG Enumeration GAACGCAACCAGTTCAGTGCTTATCATGAA Enumerative algorithms exhaustively cover T A T CT TAA GGT A AGG the space of all possible motifs, for a specific CACAACGGTCGCTCT motif model description. For example, dictionary-based methods count the number Patrik D’haeseleer is in the Microbial Systems G Division, Biosciences Directorate, Lawrence CT G T T AA G A TGG CA G CACC ATACA TT Livermore National Laboratory, 7000 East Ave., PO Box 808, L-448, Livermore, Figure 1 Starting from a single site, expectation maximization algorithms such as MEME4 alternate California 94551, USA. between assigning sites to a motif (left) and updating the motif model (right). Note that only the best e-mail: [email protected] hit per sequence is shown here, although lesser hits in the same sequence can have an effect as well. NATURE BIOTECHNOLOGY VOLUME 24 NUMBER 8 AUGUST 2006 959 PRIMER overlook some of the subtle patterns present Interestingly, the term ‘Gibbs sampling’ does not account for the number of binding in real binding sites. was borrowed by analogy from statistical sites in the target sequences. mecha nics (via simulated annealing), where Likelihood of a model refers to the proba- Deterministic optimization it refers to the inverse exponential relation- bility that the observed data could have been Expectation Maximization (EM) can be used ship between the energy of a microstate and its generated by the model in question. Typically, β to simultaneously optimize a position weight probability, p = Ce– Em . Because the weight one optimizes the logarithm of this probabil- m matrix (PWM) description of a motif1, and matrix score of a site is actually related to the ity (hence ‘log likelihood’) with respect to the the binding probabilities for its associated estimated binding energy of protein binding parameters of the model: sites (Fig. 1). The weight matrix for the motif to that site1, the analogy is particularly apt in is initialized with a single n-mer subsequence, this case. LOG,MODEL\DATA LOG0RDATA\MODEL plus a small amount of background nucleo- LOG0RDATA \MODEL tide frequencies. Next, for each n-mer in the Which one? 3I I target sequences, we calculate the probabil- Which methods should we use, out of this ity that it was generated by the motif, rather panoply of choices? In a recent large-scale Log likelihood allows for more sophisticated than by the background sequence distribu- comparison between 13 different motif disco- background models (typically 3rd or higher tion. Expectation maximization then takes a very algorithms by Tompa et al.5,6, enumerative order Markov models), making it easier to weighted average across these probabilities approaches such as Weeder3 and YMF2 per- rule out low-complexity repeats such as poly- to generate a more refined motif model. The formed surprisingly well on a set of eukaryotic A or poly-T sequences, which would otherwise algorithm iterates between calculating the sequences with known motifs. (A complemen- show up as high information content motifs. probability of each site based on the current tary comparison on Escherichia coli sequences It also takes the number of binding sites into motif model, and calculating a new motif was performed by Hu, Li & Kihara7.) However, account, not just how well they are conserved. model based on the probabilities. It can be each algorithm typically covers only a small In fact, the log likelihood is roughly propor- shown that this procedure performs a gradi- subset of the known binding sites, with rela- tional to the information content of the motif http://www.nature.com/naturebiotechnology ent descent, converging to a maximum of the tively little overlap between the algorithms. It is times the number of binding sites. log likelihood of the resulting model. therefore advised to combine the results from The maximum a posteriori probability One popular implementation of the expec- multiple motif discovery tools, ideally covering (MAP) estimate of a model is the one that tation maximization algorithm, MEME4, a range of motif descriptions and search algo- maximizes the probability of the model given performs a single iteration for each n-mer in rithms. For example, MotifSampler8 (a Gibbs the data. If all models are a priori equally the target sequences, selects the best motif sampling implementation using a higher order likely, this is identical to the maximum like- from this set and then iterates only that one Markov background model) was found to be lihood estimate above. However, if we have to convergence, avoiding local maxima. This complementary to a number of other, non- some prior information regarding the motifs partially enumerative nature of MEME pro- Gibbs, methods, including MEME4, YMF2 and (for example, a total number of expected sites vides some assurance that the algorithm is Weeder3. in the target sequences), we can use Bayes rule unlikely to get stuck in a poor local maxi- Note that the implementation details of to bias the solution accordingly: Nature Publishing Group Group Nature Publishing 6 mum. Additional motifs present in the set of the algorithm—how the motifs are repre- target sequences can be found by masking sented, whether the motif width and num- 200 0RDATA\MODEL 0RMODEL 0RMODEL\DATA © the sequences matched by the first motif and ber of occurrences can be optimized, which 0RDATA rerunning the algorithm. objective function is being optimized—may be more important in practice than which LOG0RMODEL\DATA Probabilistic optimization optimization engine (EM or Gibbs sampling) 4LOG0RDATA\MODEL LOG0RMODEL Gibbs sampling can be viewed as a stochastic is being used. implementation of expectation maximization. So let’s say we’ve run a number of differ- Whereas the latter takes a weighted average ent algorithms, possibly returning multiple The MAP score therefore includes the log- across all subsequences (weighted with the motifs from each. How do we decide which of likelihood score, plus it also allows incorpo- current estimate of the probability that they these many motifs are biologically relevant? ration of additional value judgments of the belong to the motif), Gibbs sampling takes a Nearly every algorithm uses a different evalu- quality of the motif. weighted sample from these subsequences. ation criterion to optimize or score motifs, Although these various measures are widely The motif model is typically initialized with although often some variant of a log likeli- used and have been very successful at discov- a randomly selected set of sites, and every site hood score is used: ering novel motifs, in practice none of them in the target sequences is scored against this turns out to be sufficient to reliably distin- initial motif model. At each iteration, the algo- Information content, log likelihood and guish known binding sites from spurious ones rithm probabilistically decides whether to add MAP score by themselves6,7,9. Therefore it is important a new site and/or remove an old site from the Information content and relative entropy (see not just to take the output of these algorithms motif model, weighted by the binding proba- the earlier Primer on DNA sequence motifs1) at face value. (For some general guidelines on bility for those sites. The resulting motif model measure how much a motif deviates from a how to avoid or differentiate true from false- is then updated, and the binding probabilities background distribution of nucleotide fre- positive predictions, refer to Box 1).