review articles

DOI:10.1145/2810036 He first looked for the most frequent Tracing the first four decades in the life symbol and changed it into the most frequent letter of English, then simi- of suffix trees, their many incarnations, larly inferred the most frequent word, and their applications. then punctuation marks, and so on. Both before and after 1843, the BY ALBERTO APOSTOLICO, MAXIME CROCHEMORE, natural impulse when faced with MARTIN FARACH-COLTON, ZVI GALIL, AND S. MUTHUKRISHNAN some mysterious message has been to count frequencies of individual to- kens or subassemblies in search of a clue. Perhaps one of the most intense and fascinating subjects for this kind 40 Years of scrutiny have been biosequences. As soon as some such sequences be- came available, statistical analysts tried to link characters or blocks of of Suffix Trees characters to relevant biological func- tions. With the early examples of whole genomes emerging in the mid- 1990s, it seemed natural to count the occurrences of all blocks of size 1, 2, and so on, up to any desired length, looking for statistical characteriza- tions of coding regions, promoter re- gions, among others. This article is not about cryptogra- WHEN WILLIAM LEGRAND finally decrypted the string, phy. It is about a and it did not seem to make much more sense than it its variants, and the many surprising and useful features it carries. Among did before. these is the fact that, to set up a sta- tistical table of occurrences for all (also called factors), of any 53‡‡‡305))6*,48264‡.)4z);806”,48†8P60))85;1‡ length, of a text string of n characters, (;:‡*8†83(88)5*†,46(;88*96*?;8)* ‡ (;485);5*†2:* ‡ it only takes time and space linear in (;4956*2(5*Ñ4)8P8*;4069285);)6‡8)4‡‡;1(‡9;48081;8: the length of the text string. While no- body would be so foolish as to solve 8‡1;4885;4)485†528806*81(ddag9;48;(88;4(‡?34; the problem by first generating all 48)4‡;161;:188; ‡?; exponentially many possible strings and then counting their occurrences one by one, a text string may still con- The decoded message read: “A good glass in the tain Θ(n2) distinct substrings, so that tabulating all of them in linear space, bishop’s hostel in the devil’s seat forty-one degrees never mind linear time, already seems and thirteen minutes northeast and by north main puzzling. branch seventh limb east side shoot from the left eye of the death’s-head a bee line from the through We dedicate this article to the shot fifty feet out.” But at least it did sound more our friend and colleague, like natural language, and eventually guided the Alberto Apostolico (1948–2015), 36 who passed away on July 20. main character of Edgar Allan Poe’s “The Gold-Bug” He was a major figure in to discover the treasure he had been after. Legrand the development of

solved a substitution cipher using symbol frequencies. on strings. PAPUCHALKA BY IMAGE

66 COMMUNICATIONS OF THE ACM | APRIL 2016 | VOL. 59 | NO. 4 APRIL 2016 | VOL. 59 | NO. 4 | COMMUNICATIONS OF THE ACM 67 review articles

Over the years, such structures osequences. Their range of scope ex- Their impact on have held center stage in text search- tends to areas as diverse as detecting and IT at large cannot be overstated. ing, indexing, statistics, and com- plagiarism, finding surprising sub- Text searching and bioinformatics pression as well as in the assembly, strings in a text, testing the unique would not be the same without them. alignment, and comparison of bi- decipherability of a code, and more. In 2013, the Combinatorial symposium celebrated the Figure 1. The expanded of the string x = abcabcaba. 40th anniversary of the appearance of Weiner’s invention of the suffix tree41 c with a special session entirely dedi- a a b cated to that event. $ b b $ History Bits and Pieces c 10 a At the dawn of “stringology,” Donald c a a Knuth conjectured the problem of 9 c $ $ a a $ finding the longest com- 6 8 mon to two long text sequences of to- 7 a b b tal length n required (n log n) time. An b a c O(n log n)-time had been provided by a Karp, Miller, and Rosenberg.26 That $ c a a $ construction was destined to play a 3 a role in parallel pattern matching, but b $ 4 Knuth’s conjecture was short lived: in a b 5 1973, Peter Weiner showed the prob- a lem admitted an elegant linear-time $ solution,41 as long as the alphabet of 1 $ the string was fixed. Such a solution 2 was actually a byproduct of a con- struction he had originally set up for a different purpose, that is, identify- ing any substring of a text file with- Figure 2. Building an expanded suffix tree by insertion of consecutive suffixes (showing out specifying all of them. In doing here the insertion of abcaba$). so, Weiner introduced the notion of a textual inverted index that would The insertion of suffixsuf i (i = 1, 2, …, n) consists of two phases. In the first phase, we search forsuf i elicit refinements, analyses, and ap- in Ti – 1. Note the presence of $ guarantees that every suffix will end in a distinct leaf. Therefore, this search will end with failure sooner or later. At that point, we will have identified the longest prefix of plications for 40 years and counting,

sufi that has a locus (that is, a terminal node) inT i – 1. Let headi abcab in the example be this prefix a feature hardly shared by any other and α the locus of headi. We can write sufi = headi ∙ taili with taili (a$ in the example) nonempty. In data structure. the second phase, we need to add to Ti – 1 a path leaving node α and labeled taili. This achieves the Weiner’s original construction pro- transformation of Ti – 1 into Ti . cessed the text file from right to left. c As each new character was read in, the a a b structure, which he called a “bi-tree,” b b would be updated to accommodate c longer and longer suffixes of the text c file. Thus, this was an inherently off- a a c line construction, since the text had a to be known in its entirety before the a b b construction could begin. Alterna- b b a tively, one could say the c would build the structure for the re- $ c $ c verse of the text online. About three 3 years later, Ed McCreight provided a a a 4 left-to-right algorithm and changed b b the name of the structure to “suffix 32 $ a tree,” a name that would stick. 1 $ Let x be a string of n − 1 symbols over some alphabet Σ and $ an extra 2 character not in Σ. The expanded suf-

fix tree Tx associated with x is a digital search tree collecting all suffixes of x$.

Specifically, Tx is defined as follows.

68 COMMUNICATIONS OF THE ACM | APRIL 2016 | VOL. 59 | NO. 4 review articles

1. Tx has n leaves, labeled from 1 to n. up with an entirely different and el- 2. Each arc is labeled with a symbol egant construction!” In unpublished key insights of Σ ∪{$}. For any i, 1 ≤ i ≤ n, the con- lecture notes of 1975, Vaughan Pratt ˽˽ The suffix tree is the core data structure catenation of the labels on the path displayed the duality of this structure in string analysis. 37 from the root of Tx to leaf i is precisely and Weiner’s “repetition finder.” ˽˽ It has a rich history, with connections the suffix McCreight’s algorithm was still in- to compression, matching, automata, data structures and more. sufi = xixi+1…xn−1$. herently offline, and it immediately ˽˽ There are powerful techniques to build 3. For any two suffixes sufi and sufj triggered a search for an online ver- suffix trees and use them efficiently in of x$, if wij is the longest common pre- sion. Some partial attempts at an on- many applications. fix that sufi and sufj have in common, line algorithm were made, but such then the path in Tx relative to wij is a variant had to wait almost two de- the same for sufi and sufj . cades for Esko Ukkonen’s paper in a string of n characters has only O(n) An example of expanded suffix tree 1995.39 In all these linear-time con- states and edges. Initially coined a is given in Figure 1. structions, linearity was based on directed acyclic word graph (DAWG), The tree can be interpreted as the assumption of a finite alphabet it can even be further reduced if all the state transition diagram of a de- and took Θ(n log n) time without states are terminal states.14 It then ac- terministic finite automaton where that assumption. In 1997, Martin cepts all substrings of the string and all nodes and leaves are final states, Farach introduced an algorithm that is called the factor—substring autom- the root is the initial state, and the abandoned the one suffix-at-time aton. There is a nice relation between labeled arcs, which are assumed to approach prevalent until then; this the index data structures when the point downward, represent part of algorithm gives a linear-time reduc- string has no end-marker and its suf- the state-transition function. The tion from suffix-tree construction fixes are marked with terminal states state transitions not specified in the to character sorting, and thus is op- in the tree. diagram lead to a unique non-final timal for all alphabets.17 In particu- Then, the suffix tree is the edge- sink state. Our automaton recognizes lar, it runs in linear time for a larg- compacted version of the tree and its the (finite) language consisting of all er class of alphabets, for example, number of nodes can be minimized substrings of string x. This observa- when the alphabet size is polynomial like with any automaton thereby tion also clarifies how the tree can be in input length. providing the compact DAWG of the used in an online search: letting y be Around 1984, Blumer et al.9 and Cro- string. Permuting the two operations, the pattern, we follow the downward chemore14 exposed the surprising re- compaction and minimization, leads path in the tree in response to con- sult that the smallest finite automaton to the same structure. Apparently Ana- secutive symbols of y, one symbol at a recognizing all and only the suffixes of toli Slissenko (see the appendix avail- time. Clearly, y occurs in x if and only if this process leads to a final state. Figure 3. A suffix tree in compact form.

In terms of Tx, we say the locus of a string y is the node α, if it exists, such This is obtained by first collapsing every chain formed by nodes with only one child into a single arc. that the path from the root of Tx to α The resulting compact version of Tx has at most n internal nodes, since there are n + 1 leaves in total is labeled y. and every internal node is branching. The labels of the generic arc are now a substring, rather than a symbol of x$. However, arc labels can be expressed by suitable pairs of pointers to a common copy of An algorithm for the direct con- x$ thus achieving O(n) space bound overall. struction of the expanded Tx (often called suffix ) is readily derived c (see Figure 2). We start with an empty a a tree and add to it the suffixes ofx $ one $ b at a time. This procedure takes time b $ b 2 2 Θ(n ) and O(n ) space, however, it is a 10 c c $ a a easy to reduce space to O(n) thereby 9 a a $ $ producing a suffix tree in compact 7 c form (Figure 3). Once this is done, it 6 b b a 8 becomes possible to aim for an ex- a a b pectedly non-trivial O(n) time con- c $ struction. $ a a At the CPM Conference of 2013, c 3 McCreight revealed his O(n) time b $ 4 a construction was not born as an al- a ternative to Weiner’s—he had de- b 5 veloped it in an effort to understand $ a Weiner’s paper, but when he showed 1 $ it to Weiner asking him to confirm 2 he had understood that paper the answer was “No, but you have come

APRIL 2016 | VOL. 59 | NO. 4 | COMMUNICATIONS OF THE ACM 69 review articles

able with this article in the ACM Digital The regularities exploited by Kol- Thus, by a remarkable alignment Library under Source Material) end- mogorov’s universal and omniscient of stars, the compression method ed up with a similar structure for his machine could be of any conceivable brought about by Lempel and Ziv was work on the detection of repetitions kind, but what if one limited them to not only optimal in the information in strings. These automata provide the syntactic redundancies affecting theoretic sense, but it found an opti- another more efficient counterexam- a text in the form of repeated sub- mal, linear-time implementation by ple to Knuth’s conjecture when they strings? If a string is repeated many the suffix tree, as was detailed imme- are used, against the grain, as pattern- times one could profitably encode all diately by Michael Rodeh, Vaugham matching machines (see Figure 4). occurrences by a pointer to a com- Pratt, and Shimon Even.38 The appearance of suffix trees mon copy. This copy could be internal In his original paper, Weiner listed dovetailed with some interesting and or external to the text. In the former a few applications of his “bi-tree” in- independent developments in in- case one could have pointers going in cluding most notably offline string formation theory. In his famous ap- both directions or only in one direc- searching: preprocessing a text file proach to the notion of information, tion, allow or forbid nesting of point- to support queries that return the oc- Kolmogorov equated the information ers, and so on. In his doctoral thesis, currences of a given pattern in time or structure in a string to the length Jim Storer showed that virtually all linear in the length of the pattern. of the shortest program that would such “macro schemes” are intracta- And of course, the “bi-tree” addressed be needed to produce that string by ble, except one. Not long before that, Knuth’s conjecture, by showing how a Universal Turing Machine. The un- in a landmark paper entitled “On the to find the longest substring com- fortunate thing is this measure is not Complexity of Finite Sequences,”30 mon to two files in linear time for a computable and even if it were, most Abraham Lempel and Jacob Ziv had finite alphabet. There followed un- long strings are incompressible (that proposed a variable-to-block encod- published notes by Pratt entitled “Im- is, lack a short program producing ing, based on a simple of the provements and Applications for the them), since there are increasingly text with the feature that the compres- Weiner Repetition Finder.”37 A decade many long strings and comparatively sion achieved would match, in the later, Alberto Apostolico would list much fewer short programs (them- limit, that produced by a compressor more applications in a paper entitled selves strings). tailored to the source probabilities. “The Myriad Virtues of Suffix Trees,”2

Figure 4. The compact suffix tree (left) and the suffix automaton (right) of the string “bananas.”

Failure links are represented by the dashed arrows. Despite the fact it is an index on the string, the same automaton can be used as a pattern-matching machine to locate substrings of “bananas” in another text or to compute their longest common substring. The process runs online on the second string. Assume for example “bana” has just been scanned from the second string and the current state of the automaton is state 4. If the next letter is “n,” the common substring is “banan” of length 5 and the new state is 5. If the next letter is “s,” the failure link is used and from state 3’ corresponding to a common substring “ana” of length 3 we get the common substring “ana” with the new state 7. If the next letter is “b,” iterating the failure link leads to state 0 and we get the common substring “b” with the new state 1. Finally, any other next letter will produce the empty common substring and state 0.

n b a n $ a a a $

n 7 $ 3

a n $ 5 n s a n n s 6 b a a a a 0 1 2 3 4 5 6 7

n $ n $ n s 2′ a 3′ a a 4 n 1′ s $ 1

2

70 COMMUNICATIONS OF THE ACM | APRIL 2016 | VOL. 59 | NO. 4 review articles and two decades later suffix trees and struction to give the first linear-time companion structures with their ap- algorithms for directly constructing plications gave rise to several chap- the ; that is, the first linear- ters in reference books by Croche- time algorithms for computing suffix more and Rytter, Dan Gusfield, and arrays that did not first compute the Crochemore, Hancart, and Lecroq Although the full suffix tree. Since then, there have (see the appendix available with this suffix array been many algorithms for fast con- article in the ACM Digital Library). struction of suffix arrays, notably by The space required by suffix trees seemed at first Nong, Zhang, and Chan,35 which is has been a nuisance in applications to be a different linear time and fast in practice. With where they were needed the most. fast construction algorithms and With genomes on the order of giga- data structure than small space required, the suffix ar- bytes, for instance, the space differ- ray is the suffix-tree variant that has ence between 20 times larger than the suffix tree, gained the most widespread adoption the source versus, say, only 11 times the distinction in software systems. A more recent larger, can be substantial. For a few succinct suffix tree and array, which lustra, Stefan Kurtz and his co-work- has receded. take O(n) bits to represent for a binary ers devoted their effort to cleverly al- alphabet (O(n log σ) bits otherwise), locating the tree and some of its com- was presented by Grossi and Vitter.21 panion structures.28 In 2001, David R. Actually, the histories of suffix Clark and J. Ian Munro proposed one trees and compression are tightly in- of the best space-saving methods on tertwined. This should not come as a secondary storage.13 Clark and Mun- surprise, since the redundancies that ro’s “succinct suffix tree” sought to pattern discovery to unearth are preserve as much of the structure of ideal candidates to be removed for the suffix tree as possible. Udi Manber purposes of compression. In 1994, M. and Eugene W. Myers took a different Burrows and D.J. Wheeler proposed a approach, however. In 1990, they in- breakthrough compression method troduced the “suffix array,”31 which based on suffix sorting.11 Circa 1995, eliminated most of the structure of Amihood Amir, Gary Benson, and the suffix tree, but was still able to Martin Farach posed the problem of implement many of the same opera- searching in compressed texts.1 In tions, requiring space equal to 2 inte- 2000, Paolo Ferragina and Giovanni gers per text character and searching Manzini introduced the FM-inde x, a in time O(|P| + log n) (reducible to 1 by compressed suffix array based on the accepting search time O(|P| + log n)). Burrows-Wheeler transform.19 This The suffix array stores the suffixes of structure, which may be smaller than the input in lexicographic order and the source file, supports searching can be seen as the sequence of leaves’ without decompression. This was ex- labels as found in the suffix tree by a tended to compressed tree indexing preorder traversal that would expand problems in Ferragina et al.18 using a each node according to the lexico- modification of the Burrows-Wheeler graphic order. transform. Although the suffix array seemed at first to be a different data structure Fallout, Extensions, than the suffix tree, the distinction and Challenges has receded. For example, Manber As highlighted out the outset, there and Myers’s original construction of has been hardly any application of the suffix array took O(n log n) time text processing that did not need for any alphabet, but the suffix array these indexes at one point or another. could be constructed in linear time A prominent case has been search- from the suffix tree for any alphabet. ing with errors, a problem first ef- In 2001, Toru Kasai et al.27 showed the ficiently tackled in 1985 by Gad Lan- suffix tree could be constructed in lin- dau in his Ph.D. thesis.29 In this kind ear time from the suffix array. There- of search, one looks for substrings of fore, the suffix array was shown to be the text that differ from the pattern in a succinct representation of the suffix a limited number of errors such as a tree. In 2003, three groups presented single character deletion, insertion three different modifications of Far- or substitution. To efficiently solve ach’s algorithm for suffix tree con- this problem, Landau combined suf-

APRIL 2016 | VOL. 59 | NO. 4 | COMMUNICATIONS OF THE ACM 71 review articles

fix trees with a clever solution to the Among the latter, there is the prob- so-called lowest common ancestor lem of computing the forbidden or (LCA) problem. The LCA problem as- absent words of a text, which are min- sumes a rooted tree is given and then imal strings that do not appear in the it seeks, for any pair of nodes, the low- text (while all their proper substrings est node in the tree that is an ances- There are multiple do).8,15 Such words lead to, among tor of both.23 It is seen that following uses of suffix trees other things, an original approach to a linear-time preprocessing of the text compression.16 Once regarded tree any LCA query can be answered in setting up some as the succinct representation of the in constant time. Landau used LCA kind of signature “bag-of-words” of a text, suffix trees queries on suffix trees to perform can be used to assess the similarity of constant-time jumps over segments for text strings, as two text files, thereby supporting clus- of the text that would be guaranteed tering, document classification, and to match the pattern. When k errors well as measures even phylogeny.4,12,40 Intuitively, this is are allowed, the search for an occur- of similarity or done by assessing how much the trees rence at any given position can be for the two input sequences have in abandoned after k such jumps. This difference. common. Suitably enriched with the leads to an algorithm that searches probability of the substring ending at for a pattern with k errors in a text of n each node, a tree can be used to detect characters in O(nk) steps. surprisingly over-represented sub- Among the basic primitives sup- strings of any length,3 for example, in ported by suffix trees and arrays, one the quest of promoter regions in bi- finds, of course, the already men- osequences. tioned search for a pattern in a text in The suffix tree of the concatena- time proportional to the length of the tion of say, k ≥ 2 text files, supports pattern rather than the text. In fact, it efficient solutions to problems aris- is even possible to enumerate occur- ing in domains ranging from plagia- rences in time proportional to their rism detection to motif discovery in number and, with trivial preprocess- biosequences. The need for k distinct ing of the tree, tell the total number of end-markers poses some subtleties occurrences for any query pattern in in maintaining linear time, for which time proportional to the pattern size. the reader is referred to Gusfield.22 In The problem of finding the longest its original form, the problem of in- substring appearing twice in a text dexing multiple texts was called the or shared between two files has been “color problem” and seeks to report, noted previously: this is probably for any given query string and in time where it all started. A germane prob- linear in the query, how many docu- lem is that of detecting squares, rep- ments out of the total of k contain at etitions, and maximal periodicities least one occurrence of the query. A in a text, a problem rooted in work by simple and elegant solution was given Axel Thue dated more than a century in 1992 by Lucas C.K. Hui.25 Recently, ago with multiple contemporary ap- the combined suffix trees of many plications in compression and DNA strings (also know as the generalized analysis. A square is a pattern consist- suffix tree) was used to solve a variety ing of two consecutive occurrences of document listing problems. Here, a of the same string. Suffix trees have set of text documents is preprocessed been used to detect in optimal O(n log as a combined suffix tree. The prob- n) time all squares (or repetitions) in a lem is to return the list of all docu- text, each with its set of starting posi- ments that contain a query pattern tions,5 and later to find and store all in time proportional to the number distinct square substrings in a text in of such documents, not to the total linear time. Squares play a role in an number of occurrences (occ), which augmentation of the suffix tree suit- can be significantly larger. This prob- able to report, for any query pattern, lem was solved in Muthukrishnan33 by the number of its non-overlapping oc- reducing it to range minimum queries. currences.6,10 This basic document-listing prob- There are multiple uses of suf- lem has since been extended to many fix trees in setting up some kind of other problems including listing the signature for text strings, as well as top-k in various string and informa- measures of similarity or difference. tion distances. For example, in Hon

72 COMMUNICATIONS OF THE ACM | APRIL 2016 | VOL. 59 | NO. 4 review articles et al.,24 the structure of generalized Acknowledgments. We are grate- 23. Harel, D. and Tarjan, R.E. Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13, 2 suffix tree is crucially used to design ful to Ed McCreight, Ronnie Martin, (1984), 338–355. a linear machine-word data structure Vaughan Pratt, Peter Weiner, and Ja- 24. Hon, W.-K., Shah, R. and Vitter, J.S. Space-efficient framework for top-k string retrieval problems. In to return the top-k most frequent doc- cob Ziv for discussions and help. We FOCS. IEEE Computer Society, 2009, 713–722. uments containing a pattern p in time are indebted to the referees for their 25. Hui, L.C.K. Color set size problem with applications to string matching. In Proceedings of the 3rd nearly linear in pattern size. careful scrutiny of an earlier version Annual Symposium on Combinatorial Pattern One surprising variant of the suffix of this article, which led to many im- Matching, no. 644 in Lecture Notes in Computer Science, (Tucson, AZ, 1992). A. Apostolico, M. tree was introduced by Brenda Baker provements. Crochemore, Z. Galil, and U. Manber, Eds. Springer- Verlag, Berlin, 230–243. for purposes of detection of plagia- 26. Karp, R.M., Miller, R.E., and Rosenberg, A.L. Rapid rism in student reports as well as op- References identification of repeated patterns in strings, trees 1. Amir, A., Benson, G. and Farach, M. Let sleeping th 7 and arrays. In Proceedings of the 4 ACM Symposium timization in software development. files lie: Pattern matching in Z-compressed files. In on the Theory of Computing (Denver, CO, 1972). ACM th This variant of pattern matching, Proceedings of the 5 ACM-SIAM Annual Symposium Press, 125–13. on Discrete Algorithms (Arlington, VA, 1994), 705–714. 27. Kasai, T., Lee, G., Arimura, H., Arikawa, S. and Park, called “parameterized matching,” en- 2. Apostolico, A. The myriad virtues of suffix trees. K. Linear-time longest-common-prefix computation ables one to find program segments Combinatorial Algorithms on Words, vol. 12 of NATO in suffix arrays and its applications.CPM. Springer- Advanced Science Institutes, Series F. A. Apostolico Verlag, 2001, 181–192. that are identical up to a systematic and Z. Galil, Eds. Springer-Verlag, Berlin, 1985, 85–96. 28. Kurtz, S. Reducing the space requirements of suffix change of parameters, or substrings 3. Apostolico, A., Bock, M.E. and Lonardi, S. Monotony of trees. Softw. Pract. Exp. 29, 13 (1999), 1149–1171. surprise and large-scale quest for unusual words. 29. Landau, G.M. String matching in erroneus input. that are identical up to a systematic J. Computational Biology 10, 3 / 4 (2003), 283–311. Ph.D. Thesis, Department of Computer Science, Tel- relabeling or permutation of the char- 4. Apostolico, A., Denas, O. and Dress, A. Efficient tools Aviv University, 1986. for comparative substring analysis. J. Biotechnology 30. Lempel, A. and Ziv, J. On the complexity of finite acters in the alphabet. One obvious 149, 3 (2010), 120–126. sequences. IEEE Trans. Inf. Theory 22 (1976), 75–81. 5. Apostolico, A. and Preparata, F.P. Optimal off-line extension of the notion of a suffix 31. Manber, U. and Myers, G. Suffix arrays: A new method detection of repetitions in a string. Theor. Comput. Sci. for on-line string searches. In Proceedings of the 1st tree is to more than one dimension, 22, 3 (1983), 297–315. ACM-SIAM Annual Symposium on Discrete 6. Apostolico, A. and Preparata, F.P. Data structures albeit the mechanics of the extension Algorithms (San Francisco, CA, 1990), 319–327. and algorithms for the strings statistics problem. 32. McCreight, E.M. A space-economical suffix tree itself are far from obvious.34 Among Algorithmica 15, 5 (May 1996), 481–494. construction algorithm. J. Algorithms 23, 2 (1976), 7. Baker, B.S. Parameterized duplication in strings: 262–272. more distant relatives, one finds Algorithms and an application to software maintenance. 33. Muthukrishnan, S. Efficient algorithms for document “wavelet trees.” Originally proposed SIAM J. Comput. 26, 5 (1997), 1343–1362. listing problems. In Proceedings of the 13th ACM- 8. Béal, M.-P., Mignosi, F. and Restivo, A. Minimal SIAM Annual Symposium on Discrete Algorithms as a representation of compressed forbidden words and symbolic dynamics. In (2002), 657–666. th suffix arrays,20 wavelet trees enable Proceedings of the 13 Annual Symposium on 34. J. C. Na, P. Ferragina, R. Giancarlo, and K. Park. Two- Theoretical Aspects of Computer Science, vol. 1046 of dimensional pattern indexing. In Encyclopedia of one to perform on general alphabets Lecture Notes in Computer Science (Grenoble, France, Algorithms. 2008. the ranking and selection primitives Feb. 22–24, 1996). Springer, 555–566. 35. Nong, G., Zhang, S. and Chan, W.H. Two efficient 9. Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., algorithms for linear time suffix array construction. previously limited to bit vectors, and Chen, M.T. and Seiferas, J. The smallest automaton IEEE Trans. Comput. 60, 10 (2011), 1471–1484. more. recognizing the subwords of a text. Theor. Comput. Sci. 36. Poe, E.A. The Gold-Bug and Other Tales. Dover Thrift 40, 1 (1985), 31–55. Editions Series. Dover, 1991. The list could go on and on, but the 10. Brodal, G.S., Lyngsø, R.B., Östlin, A. and Pedersen, C.N.S. 37. Pratt, V. Improvements and applications for the Solving the string statistics problem in time O(n log n). Weiner repetition finder. Manuscript, 1975. scope of this article was not meant th In Proceedings of the 29 International Colloquium on 38. Rodeh, M., Pratt, V. and Even, S. Linear algorithm to be exhaustive. Actually, after 40 Automata, Languages and Programming, vol. 2380 of for via string matching. J. Assoc. Lecture Notes in Computer Science (Malaga, Spain, years of unrelenting developments, Comput. Mach. 28, 1 (1981), 16–24. July 8–13, 2002). Springer, 728–739. 39. Ukkonen, E. On-line construction of suffix trees. it is fair to assume the list will con- 11. Burrows, M. and Wheeler, D.J. A block-sorting lossless Algorithmica 14, 3 (1995), 249–260. data compression algorithm. Technical Report 124, tinue to grow. Open problems also 40. Ulitsky, I., Burstein, D., Tuller, T. and Chor, B. The Digital Equipment Corp., May 1994. average common substring approach to phylogenomic abound. For instance, many of the 12. Chairungsee, S. and Crochemore, M. Using minimal reconstruction. J. Computational Biology 13, 2 (2006), absent words to build phylogeny. Theoretical 336–350. observed sequences are expressed in Computer Science 450, 1 (2012), 109–116. 41. Weiner, P. Linear pattern matching algorithms. In 13. Clark, D.R. and Munro, J.I. Efficient suffix trees on Proceedings of the 14th Annual IEEE Symposium on numbers rather than characters, and th secondary storage. In Proceedings of the 7 ACM- Switching and , (Washington, D.C., in both cases are affected by various SIAM Annual Symposium on Discrete Algorithms, 1973), 1–11. (Atlanta, GA, 1996), 383–391. types of errors. While the outcome of 14. Crochemore, M. Transducers and repetitions. a two-character comparison is just Theor. Comput. Sci., 45, 1 (1986), 63–86. Alberto Apostolico held joint appointments with Georgia one bit, two numbers can be more or 15. Crochemore, M., Mignosi, F. and Restivo, A. Automata Tech’s School of Computational Science and Engineering and forbidden words. Information Processing Letters School of Interactive computing as a professor and a less close, depending on their differ- 67, 3 (1998), 111–117. researcher. He passed away on July 20, 2015. 16. Crochemore, M., Mignosi, F., Restivo, A and Salemi, ence or some other metric. Likewise, S. Data compression using antidictonaries. In Maxime Crochemore ([email protected]) two text strings can be more or less Proceedings of the IEEE: Special Issue Lossless Data is a professor at King’s College London and Université Compression 88, 11 (2000). J. Storer, Ed., 1756–1768. Paris-Est, France. similar, depending on the number of 17. Farach, M. Optimal suffix tree construction with large th Martin Farach-Colton ([email protected]) is a elementary steps necessary to change alphabets. In Proceedings of the 38 IEEE Annual professor in the Department of Computer Science at Symposium on Foundations of Computer Science Rutgers University, Piscataway, NJ. one in the other. The most disruptive (Miami Beach, FL, 1997), 137–143. aspect of this framework is the loss of 18. Ferragina, P., Luccio, F., Manzini, G. and Muthukrishnan, Zvi Galil ([email protected]) is Dean of the College of S. Compressing and indexing labeled trees with Computing at Georgia Institute of Technology, Atlanta, GA. the transitivity property that leads to applications. JACM 57, 1 (2009). the most efficient exact string match- 19. Ferragina, P. and Manzini, G. Opportunistic data S. Muthukrishnan ([email protected]) is a professor structures with applications. In FOCS (2000), 390–398. in the Department of Computer Science at Rutgers ing solutions. And yet indexes capa- 20. Grossi, R., Gupta, A. and Vitter, J.S. High-order entropy- University, Piscataway, NJ. ble of supporting fast and elegant ap- compressed text indexes. In SODA (2003), 841–850. 21. Grossi, R. and Vitter, J.S. Compressed suffix arrays proximate pattern queries of the kind and suffix trees with applications to text indexing and string matching. In Proceedings ACM Symposium on just highlighted would be immensely the Theory of Computing (Portland, OR, 2000). ACM useful. Hopefully, they will come up Press, 397–406). th 22. Gusfield, D.Algorithms on Strings, Trees and Sequences: soon and, in time, have their own 40 Computer Science and Computational Biology. Copyright held by authors. -anniversary celebration. Cambridge University Press, Cambridge, U.K., 1997. Publication rights licensed to ACM. $15.00.

APRIL 2016 | VOL. 59 | NO. 4 | COMMUNICATIONS OF THE ACM 73