Protein motif recognition in DNA sequences containing indel errors ∗

Darrell Conklin and Terry Farrah ZymoGenetics, Inc. 1201 Eastlake Ave. E., Seattle, WA, USA 98102 email: [email protected] phone: (206) 442-6664 fax: (206) 442-6608

Abstract Current biosequence analysis programs must be able to recognize protein-level homol- ogy or pattern in DNA sequences containing spurious insertions and deletions (indels) of . We present a method for finding simple protein motifs in such sequences, building upon the standard dynamic programming algorithm. The commonly used frameshift penalty is replaced by a probabilistic model of the indel rate, providing both a means of controlling selectivity and a guide for evaluation. Further- more, given a threshold on edit distance, the algorithm avoids computation of cells of the dynamic programming matrix that could not possibly be part of an acceptable alignment path. The method is practical for scanning large DNA sequence databases for protein motifs.

Keywords: protein motifs, DNA sequence analysis, sequencing errors, Expressed Sequence Tag, dynamic programming, statistical significance

∗in GCB98: German Conference on Bioinformatics, Cologne, 1998.

1 1 Introduction

With the advances of the Project, and other large-scale DNA sequencing projects, there has been much interest in the identification and functional annotation of DNA fragments. These data often arise from fast “single pass” sequencing of cDNA clone inserts (Boguski et al., 1993), and due to large-scale automation are prone to several types of error. These errors fall into two categories: miscalled bases and indels. The latter arise from false insertions or deletions (indels) of bases by automated base-callers, and can represent a real problem for sequence analysis methods. Indel errors cause the protein translation of a DNA sequence to switch reading frames. With a 2% indel rate, for example, the expected length of an unbroken reading frame is only 16 amino acids. Sequence alignment algorithms which compare a protein sequence to six individually translated reading frames may not identify statistically significant alignments in such short unbroken reading frames. This problem is referred to as the frameshift problem. There have been several recent efforts to devise sequence alignment algorithms which deal with the frameshift problem. (Guan and Uberbacher, 1996; Genetics Computer Group, Framesearch program, 1996; Pearson et al., 1997). These algorithms allow a path through the score matrix to wind through different reading frames, usually acquiring a penalty upon switching the frame. They are effective and sensitive with low indel rates and are now used routinely for the comparison of protein sequences with DNA sequence databases. To recognize members of protein families, one can also use protein motifs which capture, in compact expressions, regularities common to whole protein families (e.g., the Prosite database; Bairoch et al., 1996). Motifs can be hand-constructed, created directly from protein family multiple alignments, or even discovered automatically in unaligned protein sequences (Brazma et al., 1995). Multiple degenerate motifs can be combined into networks of motifs (Meyers and Mehldau, 1993), and there exists a theory of statistical significance for motif scores (Staden, 1989). In this paper we present an efficient method for finding simple motifs in DNA sequences contain- ing indels. Our basic algorithm builds upon the method of Peltola et al. (1986) which is an extension of the standard dynamic programming algorithm for finding the minimum edit distance between a motif and a sequence. We increase the efficiency of the algorithm by incorporating an extension of Ukkonen’s (1985) approximate string matching algorithm, which avoids computing portions of the dynamic programming score matrix which could not possibly be part of an acceptable align- ment path. We then extend Staden’s (1989) probability equations for protein motifs to account for frameshifts, providing a measure of the statistical significance of any alignment between a protein motif and a DNA sequence containing indels. This measure can be used both to limit the results reported by an algorithm and to evaluate their biological significance. Using an implementation of this theory we demonstrate empirically, using Prosite motifs and dbEST (Boguski et al., 1993) sequences, that our method avoids the computation of a large proportion of score matrix cells.

2 2 Definitions

A motif component is a set of amino acids. A motif component c subsumes an amino acid a if a is in the set c. The component X denotes the set of all amino acids. Subsumption between components is denoted by the relation s; s(c, a) = 1 if c subsumes a, 0 otherwise. A simple protein motif is a sequence of motif components. The notation c{n} denotes n rep- etitions of the component c. A motif subsumes a protein sequence if it can be aligned with the sequence such that subsumption holds between each aligned component/amino acid pair. In practice, we allow some mismatches to motif components. Given an alignment between motif and sequence, the edit distance of this relationship is simply the number of non-subsuming aligned components. The problem considered by the algorithm in the next section is: given a protein motif and DNA sequence that may have indel errors, find those “acceptable” alignments having an edit distance no greater than a specified threshold k.

3 Algorithm

The problem of finding a motif in a DNA sequence while allowing frameshifts was phrased in terms of dynamic programming by Peltola et al. (1986). Our formulation is quite similar. Let m be the length of a motif, and let n be the length of a DNA sequence. The subsumption problem can be solved by computing a m × (n − 2) dynamic programming score matrix E[i, j] where the second dimension of this matrix indexes an amino acid sequence comprising all three translation frames interleaved (for minus strand alignments, the matrix must also be computed for the reverse complement of the DNA sequence), e.g., for a DNA sequence abcdef the amino acid sequence (abc)(bcd)(cde)(def). The recurrence that is solved is as follows:

E[i, j] = i − s(i, j), 1 ≤ i ≤ m, 1 ≤ j ≤ 3 (1) E[1, j] = 1 − s(1, j), 1 ≤ j ≤ (n − 2), else    E[i − 1, j − 3],     E[i − 1, j − 2],    E[i, j] = (1 − s(i, j)) + min E[i − 1, j − 4],    E[i − 1, j − 1],     E[i − 1, j − 5]  where s(i, j) is the subsumption relationship between motif component i and amino acid j. The relation s can be computed in constant time if an m × 20 boolean subsumption matrix is precom- puted. Entries in the bottom row m of the matrix, with values no greater than k, represent acceptable alignments with the motif. In practice a matrix cell will also contain a back-pointer (taking on values 1 through 5), and perhaps a running count of the number of frameshifts.

3 Note that, unlike Peltola et al. (1986), we do not penalize frameshifts. This is in order to make a clear distinction between the counting of mismatches, which may compensate for an overly specific motif, and the counting of indels, which are sequence errors. The next section will show how we instead rank alignments according to their probability. Nevertheless, to guide the search towards those alignments with fewer frameshifts, a priority is imposed on the different frames to break ties (see Figure 1). This ensures that at any given position paths with the fewest frameshifts are preferred.

j − 5 j − 4 j − 3 j − 2 j − 1 j i − 1 5 (2) 3 (1) 1 (0) 2 (1) 4 (2) √ i

√ Figure 1: In matrix E[i, j], the cell marked depends on five other cells. The priorities of these cells are indicated. Beside the priorities, in brackets, are the number of frameshifts (indels in the underlying DNA sequence) counted by a back-pointer to the cell. That is, it is most desirable to count 0 frameshifts (cell j − 3), and least desirable to count either 2 insertions (cell j − 5) or 2 deletions (cell j − 1).

A basic implementation of recurrence (1) would simply compute all m×(n−2) cells of the score matrix. It is possible to do much better than this on average. To efficiently solve recurrence (1), an idea similar to Ukkonen’s (1985) O(kn) approximate string matching algorithm can be employed. This method avoids computing cells of E[i, j] which could only contain a value greater than k and therefore could not be part of an acceptable alignment path. The following theorem indicates those cells which need not be computed (see Figure 2): Theorem: For any column j in matrix E[i, j], if t is the largest index such that E[t, j]+s(t, j) ≤ k + 1, then E[t + p, j + 1] > k, for all p ≥ 2. Proof: assume the antecedent and consider E[t + p, j + 1] for any p ≥ 2. Suppose that s(t+p, j) = 0. The antecedent then states that E[t+p, j] > k+1. This implies that E[t+p−1, j−1] > k, . . . , E[t + p − 1, j − 5] > k. Now suppose that s(t + p, j) = 1. The antecedent then states that E[t+p, j] > k. Therefore since s(t+p, j) = 1, E[t+p−1, j −1] > k, . . . , E[t+p−1, j −5] > k. Also following from the antecedent is E[t + p − 1, j] > k. In both cases all cells on which E[t + p, j + 1] depends are greater than k, and since s(t + p, j + 1) ≥ 0, it follows that E[t + p, j + 1] > k. Q.E.D. To implement this theorem, we compute the matrix E[i, j] column-wise. For each column j, we retain the index t in the theorem, and set cells E[t + 2, j] through E[t + 5, j] to the value k + 2. In the subsequent column it suffices to compute only cells E[1, j + 1] through E[t + 1, j + 1], and to update the index t for that column.

4 j j + 1 t t + 1 ↑ t + 2 ↑ × . . t + p ↑ × . . m ↑ ×

Figure 2: In matrix E[i, j], the cells marked ↑ are greater than k + 1; the theorem of this paper states that those marked × need not be computed. 4 Significance of motif alignments

When scanning a database for pattern or homology it is important to know whether a result is statistically significant (Karlin and Altschul, 1990). Given a protein motif M (with m components), and an edit distance threshold k, we wish to know the probability of finding at least one acceptable alignment in random data. This probability allows us to identify insignificant results — alignments too likely to appear by chance alone — and also allows us to identify interesting motifs, often signified by a large deviation between the expected number of acceptable alignments and the observed number (Roudier et al., 1996). The probability P (M, k) of the motif M having an acceptable alignment (at most k mismatches) with a random segment of m amino acids (a DNA segment of length 3m) can be computed using probability generating functions as described by Staden (1989). This probability P (M, k) must be transformed, however, to accommodate the extra freedom allowed by the switching of frame within an alignment. The transformed probability P 0(M, k, i), derived here, will be the probability of finding an acceptable alignment (at most k mismatches) with the segment using exactly i frameshifts. A single indel at a particular position can either be a of that base, or the of any of the four nucleotides after that base. Thus there are (5 × 3m)i different ways to place i indels in the segment (this is an upper bound because a deletion and an insertion at the same position cancel each other out). Given these (15m)i DNA segments of length 3m, (15m)iP (M, k) is the upper bound on the expected number of acceptable alignments with the motif and, according to the Poisson distribution, P 0(M, k, i) = 1 − exp {−(15m)iP (M, k)} is the probability of finding at least one acceptable alignment in these segments. The expected num- ber of acceptable matches, with the same number of frameshifts, to a database with n nucleotides

5 is

2nP 0(M, k, i)

(the factor of 2 is to account for the reverse complement), and the probability (pval) of finding at least one acceptable alignment in the database is

1 − exp {−2nP 0(M, k, i)}.

In our implementation, a table of pvals for all k, i pairs is precomputed for a given motif, and during the scan of the bottom row of the matrix all acceptable alignments falling within a pval limit e (which has a default value of 0.01) are reported. Either k or e can be specified by a scientist to limit the reporting to biologically or statistically significant results. If k is specified by the scientist, we set e to a maximum of 1.0 (i.e., all acceptable alignments are reported without regard to their pval). Otherwise we set k to the maximum value such that the pval of k + 1 mismatches and 0 frameshifts is greater than e (recall that k is used to avoid computing certain matrix cells).

5 Results and discussion

The algorithm presented above finds protein motifs in DNA sequences containing indel errors. It is a dynamic programming algorithm which avoids computing cells of a score matrix which could not possibly be part of an alignment path. The number of cells avoided is related to the specified edit distance threshold k and also to the degeneracy of a particular motif. In this section we present two experiments to judge the effectiveness of the algorithm: a sensitivity analysis and a timing analysis. To explore the sensitivity of the algorithm, we have done a simulation experiment. The mRNA for human thrombopoeitin (Genbank accession L33410, length 1795 bases) was mutated with up to a 5% indel rate. To simulate an insertion at a position, the previous base was repeated. The mutated sequence was then probed using the Prosite (Bairoch, 1996) EPO TPO motif (28 components). Figure 3 presents the averaged results of 5000 simulated frameshift experiments. We explored the behavior of the algorithm with the motif at edit distance thresholds k of 0 and 1. When an edit distance threshold of k = 1 is used, the algorithm has near 100% sensitivity. At an edit distance threshold of k = 0, the algorithm is over 90% sensitive at indel rates below 1%. At a 5% indel rate, the algorithm is 64% sensitive with k = 0. The decrease in sensitivity with k = 0 is due to two limitations of our method. First, the representation of a DNA sequence by an interleaving of three translation frames is incomplete. In cases where an indel alters a translated amino acid, the indel may result in a mismatch. This problem can be addressed by a slight reformulation of recurrence (1) to consider, during score matrix construction, all possible translation products arising from an indel (Peltola et al., 1986, describe such a scheme). The second limitation is due to the post-application of a pval ranking and frameshift limit to an algorithm that minimizes mismatches. It is possible that for a given cell on

6 1 0.8

sensitivity 0.6 0.4 k = 0 0.2 k = 1 0 0 0.01 0.02 0.03 0.04 0.05 indel rate

Figure 3: Performance of the motif algorithm on simulated data (see text). the bottom row of the matrix, our algorithm will report a certain acceptable alignment, when there exists an alternative alignment with more mismatches, fewer frameshifts, and a lower pval. However, a different but equally troublesome lack of sensitivity occurs when a frameshift penalty is incorporated into the computed edit distance. One may set an edit distance threshold k but one may really be interested only in alignments with mismatch count no higher than some q < k. For a given sequence such an algorithm may report an alignment with mismatch count greater than q when there exists an alternative alignment with greater (but still acceptable) edit distance but a mismatch count no greater than q. We prefer to maintain the distinction between the counting of mismatches and frameshifts. To demonstrate the performance of this algorithm on practical examples, we have applied all (simple) Prosite motifs (Bairoch, 1996) against all sequences in dbEST (Boguski et al., 1993). Many motifs were recognized with frameshifts, indicating that the algorithm has real practical value for identifying motifs in sequences containing indel errors (Figure 4 gives an example). Table 1 shows, for some selected motifs, the fraction of the possible matrix cells actually computed in addition to the real time taken on 16 R10000 processors, and the real time taken by a basic implementation which computes all matrix cells. The algorithm behaves as expected: with increasing k the expected value of t in the theorem is higher, and more cells are computed. On average, over all simple motifs in Prosite and sequences in dbEST (using k = 0), the method computes 33% of all matrix cells, indicating a threefold increase in real time performance of the algorithm over a basic implementation (see Table 1). This paper has only considered the problem of finding simple protein motifs. A more powerful class of pattern is captured by network expressions (Meyers and Mehldau, 1993). These expressions join simple motifs together with distance constraints between them. For future research it would be interesting to enhance our motif algorithm with the ability to search for network expressions. Statistical significance of network expressions in sequence data containing indels can be evaluated using the scheme presented here.

7 Motif: GXXXXXXGX[FYW]XG[LIVM]X[LIVM]XXXXGK[NH]XG[STA]XXGXXYF Database: dbest Motif type: p Sequence type: n Mismatches tolerated: at most 1 Number of hits requested: 100 Maximum pval: 0.7

Letters in database: 512479339

Locus Matching Segment Pos Fr Sc FS Pval Closest Protein

EST1085745 GMTSFAVGKWVGVVLDEPKGKNSGSIKGQQYF 311 2 32 0 5e-08 fruitfly dynactin (100%) EST258203 GTTNFAPGYWYGIELEKPHGKNDGSVGGVQYF 95 2 32 0 5e-08 chicken restin (54%) EST415473 GPIHGKDGMFCGIELLEPNGKHDGTFQGVSYF 125 2 32 0 5e-08 yeast nuclear fusion protein bik1 (50%) EST572580 GKTDSAPGYWYGIELDHPTGKHDGSVFGVRYF 55 4 32 0 5e-08 chicken restin (58%) EST948844 GETDFAKGEWCGVELDEPLGKNDGAVAGTRYF 240 3 32 0 5e-08 human restin (100%) EST954880 GETDFAKGEWCGVELDEPLGKNDGAVAGTRYF 117 3 32 0 5e-08 human restin (100%) EST959063 GKTDFAPGYWYGIELDQPTGKHDGSVFGVRYF 141 3 32 0 5e-08 human KIAA0291 (44%) EST841405 GMTSFAVGKWVGVVLDEPKCKNSGSIKGQQYF 341 2 31 0 9e-06 fruitfly dynactin (89%) EST220568 GLTDFKPGYWIGV/AMMSHCG/KNDGSVNGKRYF 105 6 32 2 0.005 human tubulin folding cofactor B (100%) EST661274 GLTDFKPGYWIGV/AMMSHWGKNDG/SVNGKRYF 199 1 32 2 0.005 human tubulin folding cofactor B (100%) EST178357 GQQTSLQGYWYGI/SLKNPMGKNDGS/LGG/LQYF 162 3 32 3 0.5 chicken restin (39%) EST343442 GPHRFQTG/YWIGV//PMMSPLGKNDGSVNGKRYF 47 5 32 3 0.5 human tubulin folding cofactor B (100%) EST904860 GMTSFAVGKWVGVVL/GRAEG//KNSGSIKGQQYF 302 2 32 3 0.5 fruitfly dynactin (100%)

Figure 4: Performance of a protein motif search on real data. The motif is the Prosite cytoskeleton- associated proteins glycine-rich domain (CAP GLY). Each acceptable alignment is presented with EST name, matched segment, starting position of segment, starting frame, score, number of frameshifts, pval, and closest protein match according to blastx. Positions in an alignment where frameshifts were assumed are indicated with /.

8 % cells motif k computed real time real time (basic) I[KR]PX[FY]VFDGXXPXLK 0 17 0:32 2:29 1 32 0:59 2:28 [GST][LIVMP]VYAVEF 0 48 0:48 1.24 1 68 1:09 1:37 FPXR[IM]XDWLX[NQ] 0 23 0:32 1:45 1 47 1:04 2:01 CXCX{3}CX{5}CCX[DN][FY]X{3}C 0 20 0:54 3:38 1 45 1:57 3:48 CVSEXISF[LIVM]T[SG]EAS[DE][KRQ]C 0 15 0:31 2:42 1 27 0:56 3:24 RKRKYFKKHEKR 0 21 0:31 1:57 1 33 0:50 1:59 2 45 1:07 2:10 [FI]S[KR]KCS[EK]RWKTM 0 24 0:35 1:57 1 39 0:58 1:59 EXLCCX[KR]CX{4}[DE]XNX{4}CXCRVP 0 12 0:39 3:55 1 24 1:12 4:07 2 41 2:06 4:23

Table 1: Performance of the motif algorithm on dbEST. k: edit distance threshold. Real time: minutes on 16 R10000 processors; data in memory cache; threshold-sensitive algorithm and basic algorithm.

References

[1] Bairoch, A., Bucher, P. and Hoffmann, K. (1996) The PROSITE database, its status in 1995. Nucleic Acids Research, 24, 189–196.

[2] Boguski, M. S., Lowe, T. M., and Tolstoshev, C. M. (1993) dbEST–database for “Expressed Sequence Tags”. Nature Genetics, 4, 332–333.

[3] Brazma, A., Jonassen, I., Eidhammer, I., and Gilbert, D. (1995) Approaches to the Automatic Discovery of Patterns in Biosequences. Technical report, Department of Informatics, University of Bergen, Norway.

[4] Genetics Computer Group, Inc. (1996) Release 8.1.

[5] Guan, X. and Uberbacher, E. (1996) Alignments of DNA and protein sequences containing frameshift errors. Comput. Applic. Biosci., 12, 31–40.

9 [6] Karlin, S. and Altschul, S. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA, 87, 2264–2268.

[7] Meyers, E. and Mehldau, G. (1993) A system for pattern matching applications on biosequences. Comput. Applic. Biosci., 9, 299–314.

[8] Pearson, W. R., Wood, T., Zhang, Z., and Miller, W. (1997) Comparison of DNA sequences with protein sequences. Genomics, 46, 24-36.

[9] Peltola, H., Soderlund, H., and Ukkonen, E. (1986) Algorithms for the search of amino acid patterns in nucleic acid sequences. Nucleic Acids Research, 14, 1:99–107.

[10] Roudier, C., Auger, I., and Roudier, J. (1996) Molecular mimicry reflected through database screening: serendipity or survival strategy? Trends Immun., 17, 357-8.

[11] Staden, R. (1989) Methods for evaluating the probabilities of finding patterns in sequences. Comput. Applic. Biosci., 5, 89–96.

[12] Ukkonen, E. (1985) Finding approximate patterns in strings. J. Algorithms, 6, 132–137.

10