Chapter 35
Suffix Arrays: A New Method for On-Line String Searches
Udi Manber* Gene Myers#
Abstract In these cases, it is worthwhile to construct a data struc- ture to allow fast queries. Suffur trees are data structures A new and conceptually simple data structure, called a that admit efficient on-line siring searches. A suffix tree sufsuc array, for on-line string searches is introduced in for a text A of length N over an alphabet C can be built in this paper. Constructing and querying suffix arrays is 0 (N log 1C I) time and 0 (N) space cWei73, McC761. reduced to a sort and search paradigm that employs novel Suffix trees permit on-line string searches of the type, “Is aIgorithms. The main advantage of suffix arrays over W a substring of A?” to be answered in O(Plog 1x1) suffix trees is that they are three to five times more space time, where P is the length of W. We explicitly consider efficient. Suffix arrays permit on-line string searches of the dependence of the complexity of the algorithms on the type, “Is W a substring of A?” to be answered in I Z I, rather than assume that it is a fixed constant, because time 0 (Z’ +-IogN), where P is the length of W and N is C can be quite large for many applications. Suffix trees the length of A, which is competitive with (and in some can also be constructed in time 0 (N) with 0 (P) time for cases slightly better than) suffix trees. The only drawback a query, but this requires 0 (N I I: I ) space, which renders is that in those instances where the underlying alphabet is this method impractical in many applications. finite and small, suffix trees can be constructed in 0 (N) Suffix trees have been studied and used extensively. time in the worst-case versus 0 (NlogN) time for suffix A survey paper by Apostolic0 [Apo85] cites over forty arrays. We show, however, that suffix arrays can be con- references. Suffix trt~s have been refined from tries to structed in 0 (N) expected time, regardless of the alphabet minimum state finite automaton [BBE85], generalized to size. We believe that suffix arrays will prove to be better on-line construction [MR80, BB8q, and real-time con- in practice than suffix trees for many applications. struction [SliSO], and parallelized [AI86]. Suffix trees have been applied to fimdamental string problems such as finding the longest repeated substring lJVei731, finding all 1. Introduction squares or repetitions in a string [AP83], computing sub- Finding all instances of a string W in a large text A is an string statistics [AP85], approximate string matching important pattern matching probIem. There are many LV86, Mye881, and string comparison CEH86]. They appIications in which a fixed text is queried many times. have also been used to address other types of problems
* Supported in part by an NSF Presidential Young Investigator Award (grant DCR-8451397), with matching funds from AT&T. # Supported in part by the NIH (grant ROl LMO496O-01).
319 such as text compression @PE8 11,compressing assembly approximate matching algorithm [ME89]. code LFWM84], inverted indices [Car75], and analyzing The paper is organized as follows. In Section 2, we genetic sequences [CHM86]. Galil [Ga85] lists a number present the search algorithm under the assumption that the of open problems concerning suffix trees and on-line suffix array and the lcp information have been computed. string searching. In Section 3, we show how to construct the sorted suffix In this paper, we present a new data structure, array. In Section 4, we give the algorithm for computing called a suffucarray, that is basically a sorted list of all the the Icp information. In Section 5, we modify the algo- suffixes of A. When coupled with information about the rithms to achieve better expected running times. We end longest common prefixes (lcps) of adjacent elements in with empirical results and comments about practice in the suffix array, string searches can be answered in Section 6. 0 (P + IogN) time with a simple augmentation to a classic binary search. The suffix array and associated Zcp infor- 2. Searching mation occupy a mere 2N integers, and searches are Let A = aoal - * * aNml be a large text of length N. shown to require at most P + [log, (N-l)] single-symbol Denote by Ai = aiai+l * . ’ UN-1 the suffix of A that starts comparisons. The construction of the suffix array and Zcp at position i. The basis of our data structure is a lexico- information require 0 (NlogN) time in the worst ease. graphically sorted array, Pos, of the suffixes of A; namely, Under the assumption that all strings of N symbols are Pos [k] is the start position of the kth smallest suffix in the equally likely, the expected length of the longest repeated set tAosA , ...ANvl ) . The sort that produces the array substring is 0 (1ogNl log ] C I) cKGO831. By further Pos is described in the next Section. For now we assume refining our algorithms to take advantage of this fact, we that Pos is given; namely, APos[ol < Apos[l~ < .. . c can construct a suffix array and its Icp information in AP,, IN-~,, where “<” denotes the lexicographical order. 0 (N) expected time. For a string u, let ZP be the prefix consisting of the Our approach distills the nature of a suffix tree to first p symbols of u if u contains more than p symbols, its barest essence: A sorted array coupled with another to and u otherwise. We define the relation <,, to be the lexi- accelerate the search. Suffix arrays may be used in lieu of cographical order of p-symbol prefixes; that is, u cP v iff suffix trees in the many applications of this ubiquitous up < vp. We define the relations $, , =P, P, >P, and &, in a structure. Our search and sort approach is distinctly dif- similar way. Note that, for any choice of p, the Pos array ferent and, in theory, provides superior querying time at is also ordered according to $, because u c v implies the expense of somewhat slower construction. Galil u $ v. All suffixes that have equal p-prefixes, for some [Ga85, Problem 91 poses the problem of designing algo- p
r
1111R (a)
Figure 2: The three cases of the 0 (P + log N) search. 321 either I or r. It turns out that both these steps are easy to 3. Sorting make: The sorting is done in rlog2(N+1)1 stages. In the first Case 1: LZcp[M] > I (Fig. 2(a)): in this case, stage, the suffixes are put in buckets according to their A pOs[MI=!+I A,,[Lj fl+l W, and so W must be in the right fust symbol. Then, inductively, each stage further parti- half and I is unchanged, tions the buckets by sorting according to twice the Case 2: LZcp[Ml = 2 (Fig. 2(b)): in this case, we number of symbols. For simplicity of notation, we know that the first I symbols of Pos [M] and Ware equal; number the stages 1, 2,4, 8, etc., to indicate the number thus, we need to compare only the I + lth symbol, I + 2t.h of affected symbols. Thus, in the H” stage, the suffixes symbol, and so on, until we find one, say r+j, such that are sorted according to the &-order. For simplicity, we W I+i Pos [Ml. The I + jth symbol determines whether Lw pad the suffixes by adding blank symbols, such that the is in the right or left side. In either case, we also know lengths of all of them become N + 1. (This padding is not the new value of I or 1 - it is I + j. Since 1 = h at the necessary, but it simplifies the discussion.) The first stage beginning of the loop, this step takes Ah+1 single-symbol consists of a bucket sort according to the lirst symbol of comparisons. each suffix. The suffixes ate divided into ml buckets (ml I I C I), each holding the suffixes with the same first Case 3: Llcp[M ] < I (Fig. 2(c)): in this case, since symbol. Assume that after the P stage the suffixes are W matched 1 symbols of L and c I symbols of A4, it is partitioned into mH buckets, each holding suffixes with clear that Lw is in the left side and than the new value of T the same H first symbols, and that these buckets are sorted is Llcp [Ml. according to the &-relation. We will show how to sort Hence, the use of the arrays Lkp and Rlcp (the Rlcp the elements in each N-bucket to produce the h-order in array is used when I < r) reduces the number of single- 0 (N) time. Our sorting algorithm uses similar ideas to symbol comparisons to no more than Ah+1 for each itera- those in [KMR72]. tion. Summing over all iterations and observing that Let Ai and Aj be two suffixes belonging to the same P, the C Ah 5 total number of single-symbol comparisons bucket after the H” step; that is, Ai=H Aj- We need to made in an on-line string search is at most compare Ai and Aj according to the next H symbols. But, P +[logz (N-1)1, and 0 (P + 1ogN) time is taken in the the next H symbols of Ai (Ai) are exactly the first H sym- worst-case. The precise search algorithm is given in Fig. b01~ Of Ai+H (Ajm). By the assumption, we already knOW 3. the relative order, according to the
322 elements2. At the start of stage H, Pos [i] contains the know the lcps between suffixes in adjacent buckets (after start position of the irh smallest suffix, Prm[i] is the the Erst stage, the Zcps between suffixes in adjacent buck- inverse of Pos, namely, Prm [Pos [i ]] = i, and BH [i] is 1 ets are 0). At stage 2H the buckets are partitioned accord- iff Pos [i] contains the leftmost suffix of an H-bucket ing to 2H symbols. Thus, the lcps between suffixes in (i.e.. AP,,[~I ~HAP~~[~-II). Count and B2H ae temporary newly adjacent buckets must be at least H and at most arrays; their use will become apparent in the description W-l. Furthermore, if Ap and A, are in the same H- of a stage of the sort. A radix sort on the Erst symbol of bucket but are in distinct W-buckets, then each suffix is easily tailored to produce Pos, Prm, and BH for stage 1 in 0 (N) time. Assume that Pos, Prm, and BH have the correct values after stage H, and consider stage If we can maintain, after stage H, information about all 2H. lcps whose values are less than H, then computing lcps in stage 2H will be straightforward from (1). But, this is too We Erst reset Prm [i ] to point to the leftmost cell of much information. We are interested only in computing the H-bucket containing the irh suffix rather than to lcps between adjacent suffixes in the Enal order, not suffix’s precise place in the bucket. We also initialize between every pair of prefixes. Instead of computing all Count [i] to 0 for all i. AI1 operations above can be done lcps, we maintain an 0 @Q-space data structure that in 0 (N) time. We then scan the Pos array in increasing enables us to compute any lcp whose value is less than H order, one bucket at a time. Let I and r (I Ir) mark the in 0 (logN) time. We will describe this data structure, left and right boundary of the H-bucket currently being which we call an intend free, after we establish our basic scanned. Let 7’i (the H-tail of ZJdenote Pos [i]-H. For approach. every i, 1 li Ir, we increment Count [Prm [Tj]], set Prm [Ti] = Prm [Ti] + Count [Prm [Ti]] - 1, and set We define height(i) = lcp (Apes[i-l], APm(i])r B2H[Prm [Ti]] to 1. In effect, all the suffixes whose 1 ’ In fact, two integers am sufficient, and since these integers am always x IH y $f z hply kp (x, z) = mh (kp (x, y), kp @, Z))ini positive we can use their sign bit for the boolean values. ‘bus, the space follows, by induction, that if xa 323 min ( b (A~osy~~r Apmyk]) : k E b+l, f~ I). Now the root. This takes 0 (1ogN) time. lcp (u, v) 324 The expected sorting time can be reduced to 0 (N) [H, m-11. We then recursively find the Zcp of AP+~ and by modifying the radix sort of the first stage as follows. A q+n by finding the nca of suffixes AP+n and Aq+~ in the Let T = llog ,r, Nj and consider mapping each string of T history, and so on, until an nca is discovered to be the symbols over C to the integer obtained when the string is root of the history. At each successive level of the recur- viewed as a T-digit, radix- ] C ] number. This oft-used sion, the number of the nca is at least halved, and so the encoding is an isomorphism onto the range [0, ] C ] r-l] c number of levels performed is 0 (log L), where L is the [0, N-l], and the I-relation on the integers is identical lcp of AP and A,. Because the longest repeated substring with the 325 6. Practice Section 5. Our search structure thus consists of an N A primary motivation for this paper was to be able to integer suffix array and a N/4 integer bucket array, and so efficiently answer on-line string queries for very long consumes only 5 bytes per text symbol (assuming an genetic sequences (on the order of one million or more integer is 4 bytes). symbols Iong). In practice, it is the space overhead of the Table 1 summarizes a number of timing experi- query data structure that limits the largest text that may be ments on texts of length 100,000. All times are in handled. Suffix trees are quite space expensive, requiring seconds and were obtained on a VAX 8650 running roughly 16 bytes of overhead per text character. Utilizing UNIX. Columns 6 and 7 give the times for constructing an appropriate blend of the suffix array algorithms given the suffix tree and suffix array, respectively. Columns 8 in this paper, we developed an implementation requiring 5 and 9 give the time to perform 100,000 successful queries bytes of overhead per text character whose construction of length 20 for the suffix tree and array, respectively. In and search speeds are competitive with suffix trees. synopsis, suffix arrays are 3-10 times more expensive to There are three distinct ways to implement a data build, 2-5 times more space efficient, and can be queried structure for suffix trees, depending on how the outedges at speeds comparable to suffix trees. of an interior vertex are encoded. Using a I XI -element vector gives a structure requiring 8N +4( I z I +2) -I bytes References where Z is the number of interior nodes in the suffix tree. Apostolico, A., “The myriad virtues of sub- Encoding each set of outedges with a binary search tree word trees,” Combinatorial Algorithms on requires 8N + 201 bytes. Finally, encoding each outedge Words (A. Apostolico & 2. Galil, eds.), set as a linked list requires 8N + 161 bytes. The parameter NATO AS1 Series F: Computer and System Z < N varies as a function of the text. The fust four Sciences, Vol. 12, Springer-Verlag (1985), columns of Table 1 illustrates the value of Z/N and the 85-96. per-text-symbol space consumption of each of the three Apostolico, A., and C. Iliopoulos, “Parallel coding schemes. These results suggest that the linked log-time construction of suffix trees,” CSD scheme is the most space parsimonious. We developed a TR 632, Dept. of Computer Science, Purdue tightly coded implementation of this scheme for the tim- University (1986). ing comparisons with our suffix array software. Apostolico, A. and F-P. Preparata, “Optimal For our practical implementation, we chose to build off-line detection of repetitions in a string,” just a suffix array and use the radix-ZV initial bucket sort Theoretical Computer Science 22 (1983), described in Section 5 to build it in 0 (N) expected time. 297-315. Without the lcp array the search must take O(PlogN) worst-case time. However, keeping variables 1 and r as Apostolico, A. and F.P. Preparata, “Struc- suggested in arriving at the 0 (Z’ + IogN) search, tural properties of the string statistics prob- significantly improves search speed in practice. We lem,” Journal of Computer and System Sci- further accelerate the search to 0 (P) expected time by ence 31 (1985), 394-411. using a bucket table with K = loglxl N/4 as described in II Space (Bytes/text symbol) Construction Time Search Tbne S.Trees SArrays S.Trecs SArcays S.Trees SArrays I/N Link Tree Vector Random (1 X(=2) .99 23.8 27.8 19.8 5.0 2.6 7.1 6.0 5.8 Random (lE1-t) .62 17.9 20.4 18.9 5.0 3.1 11.7 5.2 5.6 Random (ICl=8) .45 15.2 17.0 20.8 5.0 4.6 11.4 5.8 6.6 Random (I X)=16) .37 13.9 15.4 30.6 5.0 6.9 11.6 9.2 6.8 Random (1X1=32) .31 13.0 14.2 46.2 5.0 10.9 11.7 10.2 7.0 Text (1X1=96) .54 16.6 18.8 220.0 5.0 5.3 28.3 22.4 9.5 Code(lCI=%) .63 18.1 20.6 255.0 5.0 4.2 35.9 29.3 9.0 DNA(lZ+l) .72 19.5 22.4 25.2 5.0 2.9 18.7 14.6 9.2 Table 1: Empirical results for texts of length 100,000. 326 [BB86] Blumer, J. and A. Blumer, “On-line con- matching,” Proc. 18th ACM Symposium on struction of a complete inverted file,” Theory of Computing (1986), 220-230. UCSC-CRL-86-11, Dept. of Computer Sci- [McC76] McCreight, E-M., “A space-economical ence, University of Colorado (1986). suffix tree construction algorithm,” Journal CBBE851 Blumer, J., Blumer, A., Ehrenfeucht, E., of the ACM 23 (1976), 262-272. Haussler, D., Chen, M.T., and J. Seiferas, P4=301 Majster, M.E., and A. Reiser, “Efficient on- “The smallest automaton recognizing the line construction and correction of position subwords of a text,” Theoretical Computer trees, SIAM Journal on Computing 9, 4 Science, (1985). (1980), 785-807. [Car75] Cardenas, A.F., “Analysis and performance FIyd81 Myers, E.W., “Incremental alignment algo- of inverted data base structures,” Comm. of rithms and their applications,” SIAM Journal the ACM 18,5 (1975), 253-263. on Computing, accepted for publication. [CHM86] Clift, B., Haussler, D., McConnell, R., WW Myers, E.W., and A. Ehrenfeucht, A sub- Schneider, T.D., and G.D. stoma, linear algorithm for similarity searching, in “Sequence landscapes,” Nucleic Aciak preparation. Research 4,1(1986), 141-158. cRpEf311 Rcxieh, M., Pratt, V.R., and S. Even, “Linear [EH861 Ehrenfeucht, A. and D. Haussler, “A new algorithm for data compression via string distance metric on strings computable in matching,” Journal of the ACM 28, 1 (1981), linear time,” UCSC-CRL-86-27, Dept.. of 16-24. Computer Science, University of Colorado (1986). [SSSO] Slisenko, A.O., “Detection of periodicitics and string-matching in real time,” Journal of [FwM841 Fraser, C., Wendt, A., and E.W. Myers, Soviet Mathematics 22,3 (1983), 1316-1387; “Analyzing and compressing assembly translated from Zpiski Nauchnykh Seminarov code,” Proceedings of the SIGPLAN Sympo- Leningradskogo Otdeleniya Matema- sium on Compiler Construction (1984). ticheskogo Instituta im. V.A. Steklova AN 117-121. SSSR, 105 (1980), 62-173. Ed851 Galil, Z., “Open problems in stringology,” [SV88] B. Schieber and U. Vi&kin, “On finding Combinatorial Algorithms on Words (A. lowest common ancestors: Simplification and Apostolico and Z. Gal& eds.), NATO AS1 parallelization,” SIAM Journal on Comput- Series F: Computer and System Sciences, ing, 17 (December 1988), pp. 1253-1262. Vol. 12, Springer-Verlag (1985), l-8. wei Weiner, P., “Linear pattern matching algo- [Go891 Gonnet G., Private communication. rithm,” Proc. 14th IEEE Symposium on D-341 Hare& D. and R.E. Tarjan, “Fast algorithms Switching and Automata Theory (1973), for finding nearest common ancestors,” l-11. SIAM Journal on Computing 13 (1984), 338-355. [KG0831 Karlin S., Ghandour G., Ost F., Tavare S., and L. J. Kom, “New approaches for cum- puter analysis of nucleic acid sequences,” Proc. Natl. Acad. Sci. USA, 80, (September 1983), 5-5664. [-721 Karp R. M., R. E. Miller, and A. L. Rosen- berg, “Rapid identification of repeated pat- terns in strings, trees and arrays,” Fourth Annual ACM Symposium on Theory of Com- puting, (May 1972), pp. 125-136. LV861 Landau, G.M. and U. Vishkin, “Introducing efficient parallelism into approximate string 327