Optimal Exact String Matching Based on Suffix Arrays

Optimal Exact String Matching BasedonSuffixArrays Mohamed Ibrahim Abouelhoda, Enno Ohlebusch, and Stefan Kurtz Faculty of Technology, University of Bielefeld P.O. Box 10 01 31, 33501 Bielefeld, Germany {mibrahim,enno,kurtz}@TechFak.Uni-Bielefeld.DE Abstract. Using the suffix tree of a string S, decision queries of the type “Is P asubstringofS?”canbeansweredinO(|P |) time and enumeration queries of the type “Where are all z occurrences of P in S?” can be answered in O(|P |+z) time, totally independent of the size of S. However, in large scale applications as genome analysis, the space requirements of the suffix tree are a severe drawback. The suffix array is a more space economical index structure. Using it and an additional table, Manber and Myers (1993) showed that decision queries and enumeration queries can be answered in O(|P |+log|S|)andO(|P |+log|S|+z) time, respectively, but no optimal time algorithms are known. In this paper, we show how to achieve the optimal O(|P |)andO(|P | + z) time bounds for the suffix array. Our approach is not confined to exact pattern matching. In fact, it can be used to efficiently solve all problems that are usually solved by a top-down traversal of the suffix tree. Experiments show that our method is not only of theoretical interest but also of practical relevance. 1 Introduction The suffix tree of a sequence S can be computed and stored in O(n)time and space [13], where n = |S|. Once constructed, it allows one to answer queries of the type “Is P a substring of S?” in O(m)time, where m = |P |.Furthermore,allz occurrences of a pattern P can be found in O(m+z)time, totally independent of the size of S. Moreover, typical string processing problems like searching for all repeats in S can be efficiently solved by a bottom-up traversal of the suffix tree of S. These properties are most convenient in a “myriad” of situations [2], and Gusfield devotes about 70 pages of his book [8] to applications of suffix trees. While suffix trees play a prominent role in algorithmics, they are not as widespread in actual implementations of software tools as one should expect. There are two major reasons for this: (i)Although being asymptotically linear, the space consumption of a suffix tree is quite large; even the recently improved implementations (see, e.g., [10])of linear time constructions still require 20 n bytes in the worst case. (ii)In most applications, the suffix tree suffers from a poor locality of memory reference, which causes a significant loss of efficiency on cached processor architectures. On the other hand, the suffix array (introduced in [12]andin[6] under the name PAT array)is a more space efficient data A.H.F. Laender and A.L. Oliveira (Eds.): SPIRE 2002, LNCS 2476, pp. 31–43, 2002. c Springer-Verlag Berlin Heidelberg 2002 32 Mohamed Ibrahim Abouelhoda et al. structure than the suffix tree. It requires only 4n bytes in its basic form. However, at first glance, it seems that the suffix array has two disadvantages over the suffix tree: (1)The direct construction of the suffix array takes O(n · log n)time. (2)It is not clear that (and how)every algorithm using a suffix tree can be replaced with an algorithm based on a suffix array solving the same problem in the same time complexity. For example, using only the basic suffix array, it takes O(m · log n)time in the worst case to answer decision queries. Let us briefly comment on the two seemingly drawbacks: (1)The O(n · log n)time bound for the direct construction of the suffix array is not a real drawback, neither from a theoretical nor from a practical point of view. The suffix array of S can be constructed in O(n)time in the worst case by first constructing the suffix tree of S;see[8]. However, in practice the improved O(n·log n)time algorithm of [ 11] to directly construct the suffix array is reported to be more efficient than building it indirectly in O(n)time via the suffix tree. (2)We strongly believe that every algorithm using a suffix tree can be replaced with an equivalent algorithm based on a suffix array and additional in- formation. As an example. let us look at the exact pattern matching problem. Using an additional table, Manber and Myers [12] showed that decision queries can be answered in O(m +logn)time in the worst case. However, no O(m)time algorithm based on the suffix array was known for this task. In this paper, we will show how decision queries can be answered in optimal O(m)time and how to find all z occurrences of a pattern P in optimal O(m+z)time. This new result is achieved by using the basic suffix array enhanced with two additional tables; each can be computed in linear time and requires only 4n bytes. In practice each of these tables can even be stored in n bytes without loss of performance. Our new approach is not confined to exact pattern matching. In general, we can sim- ulate any top-down traversal of the suffix tree by means of the enhanced suffix array. Thus, our method can efficiently solve all problems that are usually solved by a top-down traversal of the suffix tree. By taking the approach of Kasai et al. [9] one step further, it is also possible to efficiently solve all problems with enhanced suffix arrays that are usually solved by a bottom-up traversal of the suffix tree; see Abouelhoda et al. [1] for details. Clearly, it would be desirable to further reduce the space requirement of the suffix array. Recently, interesting results in this direction have been obtained. The most notable ones are the compressed suffix array introduced by Grossi and Vitter [7] and the so-called opportunistic data structure devised by Ferragina and Manzini [4]. These data structures reduce the space consumption considerably. However, due to the compression, these approaches do not allow to answer ε enumeration queries in O(m+z)time; instead they require O(m+z log n)time, where ε>0 is a constant. Worse, experimental results [5] show that the gain in space reduction has to be paid by considerably slower pattern matching; this is true even for decision queries. According to [5], the opportunistic index is 8-13 times more space efficient than the suffix array, but string matching based on the opportunistic index is 16-35 times slower than their implementation based Optimal Exact String Matching Based on Suffix Arrays 33 on the suffix array. So there is a trade-off between time and space consumption. In contrast to that, suffix arrays can be queried at speeds comparable to suffix trees, while being much more space efficient than these. Moreover, experimental results show that our method can compete with the method of [12]. In case of DNA sequences, it is even 1.5 times faster than the method of [12]. Therefore, it is not only of theoretical interest but also of practical relevance. 2BasicNotions In order to fix notation, we briefly recall some basic concepts. Let S be a string of length |S| = n over an ordered alphabet Σ. To simplify analysis, we suppose that the size of the alphabet is a constant, and that n<232. The latter implies that an integer in the range [0,n] can be stored in 4 bytes. We assume that the special symbol $ is an element of Σ (which is larger then all other elements)but does not occur in S. S[i] denotes the character at position i in S,for0≤ i<n.For i ≤ j, S[i..j] denotes the substring of S starting with the character at position i and ending with the character at position j. The suffix array suftab is an array of integers in the range 0 to n, specify- ing the lexicographic ordering of the n + 1 suffixes of the string S$. That is, Ssuftab[0],Ssuftab[1],...,Ssuftab[n] is the sequence of suffixes of S$ in ascending lexicographic order, where Si = S[i..n − 1]$ denotes the ith nonempty suffix of the string S$, 0 ≤ i ≤ n. The suffix array requires 4n bytes. The direct construction of the suffix array takes O(n · log n)time [ 12], but it can be build in O(n)time via the construction of the suffix tree; see, e.g., [8]. The lcp-table lcptab is an array of integers in the range 0 to n. We define lcptab[0] = 0 and lcptab[i] is the length of the longest common prefix of Ssuftab[i−1] and Ssuftab[i],for1≤ i ≤ n.SinceSsuftab[n] = $, we always have lcptab[n]=0;see Fig. 1. The lcp-table can be computed as a by-product during the construction of the suffix array, or alternatively, in linear time from the suffix array [9]. The lcp-table requires 4n bytes. However, in practice it can be implemented in little more than n bytes; see section 8. 3 The lcp-Intervals of a Suffix Array To achieve the goals outlined in the introduction, we need the following concepts. Definition 1. Interval [i..j], 0 ≤ i<j≤ n, is an lcp-interval of lcp-value if 1. lcptab[i] <, 2. lcptab[k] ≥ for all k with i +1≤ k ≤ j, 3. lcptab[k]= for at least one k with i +1≤ k ≤ j, 4.

Optimal Exact String Matching Based on Suffix Arrays

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support