Suffix Arrays: a New Method for On-Line String Searches

Chapter 35 Suffix Arrays: A New Method for On-Line String Searches Udi Manber* Gene Myers# Abstract In these cases, it is worthwhile to construct a data structure to allow fast queries. Suffur trees are data structures A new and conceptually simple data structure, called a that admit efficient on-line siring searches. A suffix tree sufsuc array, for on-line string searches is introduced in for a text A of length N over an alphabet C can be built in this paper. Constructing and querying suffix arrays is 0 (N log 1C I) time and 0 (N) space cWei73, McC761. reduced to a sort and search paradigm that employs novel Suffix trees permit on-line string searches of the type, “Is aIgorithms. The main advantage of suffix arrays over W a substring of A?” to be answered in O(Plog 1x1) suffix trees is that they are three to five times more space time, where P is the length of W. We explicitly consider efficient. Suffix arrays permit on-line string searches of the dependence of the complexity of the algorithms on the type, “Is W a substring of A?” to be answered in I Z I, rather than assume that it is a fixed constant, because time 0 (Z’ +-IogN), where P is the length of W and N is C can be quite large for many applications. Suffix trees the length of A, which is competitive with (and in some can also be constructed in time 0 (N) with 0 (P) time for cases slightly better than) suffix trees. The only drawback a query, but this requires 0 (N I I: I ) space, which renders is that in those instances where the underlying alphabet is this method impractical in many applications. finite and small, suffix trees can be constructed in 0 (N) Suffix trees have been studied and used extensively. time in the worst-case versus 0 (NlogN) time for suffix A survey paper by Apostolic0 [Apo85] cites over forty arrays. We show, however, that suffix arrays can be con- references. Suffix trt~s have been refined from tries to structed in 0 (N) expected time, regardless of the alphabet minimum state finite automaton [BBE85], generalized to size. We believe that suffix arrays will prove to be better on-line construction [MR80, BB8q, and real-time con- in practice than suffix trees for many applications. struction [SliSO], and parallelized [AI86]. Suffix trees have been applied to fimdamental string problems such as finding the longest repeated substring lJVei731, finding all 1. Introduction squares or repetitions in a string [AP83], computing sub- Finding all instances of a string W in a large text A is an string statistics [AP85], approximate string matching important pattern matching probIem. There are many LV86, Mye881, and string comparison CEH86]. They appIications in which a fixed text is queried many times. have also been used to address other types of problems * Supported in part by an NSF Presidential Young Investigator Award (grant DCR-8451397), with matching funds from AT&T. # Supported in part by the NIH (grant ROl LMO496O-01). 319 such as text compression @PE8 11, compressing assembly approximate matching algorithm [ME89]. code LFWM84], inverted indices [Car75], and analyzing The paper is organized as follows. In Section 2, we genetic sequences [CHM86]. Galil [Ga85] lists a number present the search algorithm under the assumption that the of open problems concerning suffix trees and on-line suffix array and the lcp information have been computed. string searching. In Section 3, we show how to construct the sorted suffix In this paper, we present a new data structure, array. In Section 4, we give the algorithm for computing called a suffucarray, that is basically a sorted list of all the the Icp information. In Section 5, we modify the algo- suffixes of A. When coupled with information about the rithms to achieve better expected running times. We end longest common prefixes (lcps) of adjacent elements in with empirical results and comments about practice in the suffix array, string searches can be answered in Section 6. 0 (P + IogN) time with a simple augmentation to a classic binary search. The suffix array and associated Zcp infor- 2. Searching mation occupy a mere 2N integers, and searches are Let A = aoal - * * aNml be a large text of length N. shown to require at most P + [log, (N-l)] single-symbol Denote by Ai = aiai+l * . ’ UN-1 the suffix of A that starts comparisons. The construction of the suffix array and Zcp at position i. The basis of our data structure is a lexico- information require 0 (NlogN) time in the worst ease. graphically sorted array, Pos, of the suffixes of A; namely, Under the assumption that all strings of N symbols are Pos [k] is the start position of the kth smallest suffix in the equally likely, the expected length of the longest repeated set tAosA , ...ANvl ) . The sort that produces the array substring is 0 (1ogNl log ] C I) cKGO831. By further Pos is described in the next Section. For now we assume refining our algorithms to take advantage of this fact, we that Pos is given; namely, APos[ol < Apos[l~ < .. c can construct a suffix array and its Icp information in AP,, IN-~,, where “<” denotes the lexicographical order. 0 (N) expected time. For a string u, let ZP be the prefix consisting of the Our approach distills the nature of a suffix tree to first p symbols of u if u contains more than p symbols, its barest essence: A sorted array coupled with another to and u otherwise. We define the relation <,, to be the lexi- accelerate the search. Suffix arrays may be used in lieu of cographical order of p-symbol prefixes; that is, u cP v iff suffix trees in the many applications of this ubiquitous up < vp. We define the relations $, , =P, P, >P, and &, in a structure. Our search and sort approach is distinctly dif- similar way. Note that, for any choice of p, the Pos array ferent and, in theory, provides superior querying time at is also ordered according to $, because u c v implies the expense of somewhat slower construction. Galil u $ v. All suffixes that have equal p-prefixes, for some [Ga85, Problem 91 poses the problem of designing algo- p <N, must appear in consecutive positions in the Pos rithms that are not dependent on 1x1 and our algorithms array, because the Pas array is sorted lexicographically. meet this criterion, i.e., 0 (P +logN) search time with an These facts are central to our search algorithm. 0 (N) space structure, independent of X. In practice, an Suppose that we wish to find all instances of a implementation based on a blend of the ideas in this paper string W=wOwl •.‘w+~ of lengthPIN inA. Let&= compares favorably with an implementation based on min(k: W$Ap,[kl or k=N) and Rw = suffix trees. Our suffix array structure requires only 5N max(k:ApO,~~l~p Work=-I). Since Pm is in $- bytes on a VAX, which is three to five times more space order, it follows that W matches aiUi+l * * * ai+p-r if and efficient than any reasonable suffix tree encoding. Search onlyifi=Pos[k] forsomekc [&,R,]. Thus,if&and times are competitive, but suffix arrays do require three to Rw can be found quickly, then the number of matches is ten times longer to build. For these reasons, we believe Rw-Lr++l and their left endpoints are given by that suffix arrays will become the data structure of choice Pas [Lw], Pas r&+1], ..Pos [Rw]. But Pas is in $-order, for the many applications where the text is very large. In hence a simple binary search can find Lw and Rw using fact, we recently found that the basic concept of suffix 0 (1ogN) comparisons of strings of size at most P; each arrays (sans the lcp and a provable efficient algorithm) such comparison requires 0 (P) single-symbol comparis- has been used in the Oxford English Dictionary (OED) ons. Thus, the Pos array allows us to find all instances of project at the university of Waterloo [Go89]. Suffix a string in A in time 0 (P 1ogN). The algorithm is given arrays have also been used as a basis for a sublinear 320 in Fig. 1. comparisons can be saved when comparing ~~~~~~~to W, bea.se AP,, [L ] =I W =r Amos [R 1 implies Apos[kl =h W for all k in [L, R 1 including M. While this reduces the 1. if W SP ApoS~oIthen number of single-symbol comparisons needed to deter- 2. L,tO 3. else if W >p ApoSINmlIthen mine the $-order of a midpoint with respect to W, it 4. r,tN turns out that the worst case running time is still 5. else 0 (P log N). 6. I 6% RI t NAN--l) To reduce the number of single-symbol comparis- 7. while R -L > 1 do ons to P +rlogz (N-l)1 in the worst case, we use precom- 8. [ M t (L+R)/2 puted information about the Zcps of A~,,IMI with each of 9. if W Ip ApoSIMI then 10. RtM A PCS [L] and Apdj[R]. Consider the set of all triples 11. else (L, M, R) that can arise in the inner loop of the binary 12. LtM search of Fig. 1. There are exactly N -2 such triples, 1 each with a unique midpoint ME 11,N-21, and for each 13.

Load more