Protein Motif Recognition in DNA Sequences Containing Indel Errors ∗

Protein motif recognition in DNA sequences containing indel errors ∗ Darrell Conklin and Terry Farrah ZymoGenetics, Inc. 1201 Eastlake Ave. E., Seattle, WA, USA 98102 email: [email protected] phone: (206) 442-6664 fax: (206) 442-6608 Abstract Current biosequence analysis programs must be able to recognize protein-level homol- ogy or pattern in DNA sequences containing spurious insertions and deletions (indels) of nucleotides. We present a method for finding simple protein motifs in such sequences, building upon the standard dynamic programming sequence alignment algorithm. The commonly used frameshift penalty is replaced by a probabilistic model of the indel rate, providing both a means of controlling selectivity and a guide for evaluation. Further- more, given a threshold on edit distance, the algorithm avoids computation of cells of the dynamic programming matrix that could not possibly be part of an acceptable alignment path. The method is practical for scanning large DNA sequence databases for protein motifs. Keywords: protein motifs, DNA sequence analysis, sequencing errors, Expressed Sequence Tag, dynamic programming, statistical significance ∗in GCB98: German Conference on Bioinformatics, Cologne, 1998. 1 1 Introduction With the advances of the Human Genome Project, and other large-scale DNA sequencing projects, there has been much interest in the identification and functional annotation of DNA fragments. These data often arise from fast “single pass” sequencing of cDNA clone inserts (Boguski et al., 1993), and due to large-scale automation are prone to several types of error. These errors fall into two categories: miscalled bases and indels. The latter arise from false insertions or deletions (indels) of bases by automated base-callers, and can represent a real problem for sequence analysis methods. Indel errors cause the protein translation of a DNA sequence to switch reading frames. With a 2% indel rate, for example, the expected length of an unbroken reading frame is only 16 amino acids. Sequence alignment algorithms which compare a protein sequence to six individually translated reading frames may not identify statistically significant alignments in such short unbroken reading frames. This problem is referred to as the frameshift problem. There have been several recent efforts to devise sequence alignment algorithms which deal with the frameshift problem. (Guan and Uberbacher, 1996; Genetics Computer Group, Framesearch program, 1996; Pearson et al., 1997). These algorithms allow a path through the score matrix to wind through different reading frames, usually acquiring a penalty upon switching the frame. They are effective and sensitive with low indel rates and are now used routinely for the comparison of protein sequences with DNA sequence databases. To recognize members of protein families, one can also use protein motifs which capture, in compact expressions, regularities common to whole protein families (e.g., the Prosite database; Bairoch et al., 1996). Motifs can be hand-constructed, created directly from protein family multiple alignments, or even discovered automatically in unaligned protein sequences (Brazma et al., 1995). Multiple degenerate motifs can be combined into networks of motifs (Meyers and Mehldau, 1993), and there exists a theory of statistical significance for motif scores (Staden, 1989). In this paper we present an efficient method for finding simple motifs in DNA sequences containing indels. Our basic algorithm builds upon the method of Peltola et al. (1986) which is an extension of the standard dynamic programming algorithm for finding the minimum edit distance between a motif and a sequence. We increase the efficiency of the algorithm by incorporating an extension of Ukkonen’s (1985) approximate string matching algorithm, which avoids computing portions of the dynamic programming score matrix which could not possibly be part of an acceptable alignment path. We then extend Staden’s (1989) probability equations for protein motifs to account for frameshifts, providing a measure of the statistical significance of any alignment between a protein motif and a DNA sequence containing indels. This measure can be used both to limit the results reported by an algorithm and to evaluate their biological significance. Using an implementation of this theory we demonstrate empirically, using Prosite motifs and dbEST (Boguski et al., 1993) sequences, that our method avoids the computation of a large proportion of score matrix cells. 2 2 Definitions A motif component is a set of amino acids. A motif component c subsumes an amino acid a if a is in the set c. The component X denotes the set of all amino acids. Subsumption between components is denoted by the relation s; s(c, a) = 1 if c subsumes a, 0 otherwise. A simple protein motif is a sequence of motif components. The notation c{n} denotes n rep- etitions of the component c. A motif subsumes a protein sequence if it can be aligned with the sequence such that subsumption holds between each aligned component/amino acid pair. In practice, we allow some mismatches to motif components. Given an alignment between motif and sequence, the edit distance of this relationship is simply the number of non-subsuming aligned components. The problem considered by the algorithm in the next section is: given a protein motif and DNA sequence that may have indel errors, find those “acceptable” alignments having an edit distance no greater than a specified threshold k. 3 Algorithm The problem of finding a motif in a DNA sequence while allowing frameshifts was phrased in terms of dynamic programming by Peltola et al. (1986). Our formulation is quite similar. Let m be the length of a motif, and let n be the length of a DNA sequence. The subsumption problem can be solved by computing a m × (n − 2) dynamic programming score matrix E[i, j] where the second dimension of this matrix indexes an amino acid sequence comprising all three translation frames interleaved (for minus strand alignments, the matrix must also be computed for the reverse complement of the DNA sequence), e.g., for a DNA sequence abcdef the amino acid sequence (abc)(bcd)(cde)(def). The recurrence that is solved is as follows: E[i, j] = i − s(i, j), 1 ≤ i ≤ m, 1 ≤ j ≤ 3 (1) E[1, j] = 1 − s(1, j), 1 ≤ j ≤ (n − 2), else E[i − 1, j − 3], E[i − 1, j − 2], E[i, j] = (1 − s(i, j)) + min E[i − 1, j − 4], E[i − 1, j − 1], E[i − 1, j − 5] where s(i, j) is the subsumption relationship between motif component i and amino acid j. The relation s can be computed in constant time if an m × 20 boolean subsumption matrix is precom- puted. Entries in the bottom row m of the matrix, with values no greater than k, represent acceptable alignments with the motif. In practice a matrix cell will also contain a back-pointer (taking on values 1 through 5), and perhaps a running count of the number of frameshifts. 3 Note that, unlike Peltola et al. (1986), we do not penalize frameshifts. This is in order to make a clear distinction between the counting of mismatches, which may compensate for an overly specific motif, and the counting of indels, which are sequence errors. The next section will show how we instead rank alignments according to their probability. Nevertheless, to guide the search towards those alignments with fewer frameshifts, a priority is imposed on the different frames to break ties (see Figure 1). This ensures that at any given position paths with the fewest frameshifts are preferred. j − 5 j − 4 j − 3 j − 2 j − 1 j i − 1 5 (2) 3 (1) 1 (0) 2 (1) 4 (2) √ i √ Figure 1: In matrix E[i, j], the cell marked depends on five other cells. The priorities of these cells are indicated. Beside the priorities, in brackets, are the number of frameshifts (indels in the underlying DNA sequence) counted by a back-pointer to the cell. That is, it is most desirable to count 0 frameshifts (cell j − 3), and least desirable to count either 2 insertions (cell j − 5) or 2 deletions (cell j − 1). A basic implementation of recurrence (1) would simply compute all m×(n−2) cells of the score matrix. It is possible to do much better than this on average. To efficiently solve recurrence (1), an idea similar to Ukkonen’s (1985) O(kn) approximate string matching algorithm can be employed. This method avoids computing cells of E[i, j] which could only contain a value greater than k and therefore could not be part of an acceptable alignment path. The following theorem indicates those cells which need not be computed (see Figure 2): Theorem: For any column j in matrix E[i, j], if t is the largest index such that E[t, j]+s(t, j) ≤ k + 1, then E[t + p, j + 1] > k, for all p ≥ 2. Proof: assume the antecedent and consider E[t + p, j + 1] for any p ≥ 2. Suppose that s(t+p, j) = 0. The antecedent then states that E[t+p, j] > k+1. This implies that E[t+p−1, j−1] > k, . , E[t + p − 1, j − 5] > k. Now suppose that s(t + p, j) = 1. The antecedent then states that E[t+p, j] > k. Therefore since s(t+p, j) = 1, E[t+p−1, j −1] > k, . , E[t+p−1, j −5] > k. Also following from the antecedent is E[t + p − 1, j] > k. In both cases all cells on which E[t + p, j + 1] depends are greater than k, and since s(t + p, j + 1) ≥ 0, it follows that E[t + p, j + 1] > k. Q.E.D. To implement this theorem, we compute the matrix E[i, j] column-wise.

Load more