40 Years of Suffix Trees
Total Page:16
File Type:pdf, Size:1020Kb
review articles DOI:10.1145/2810036 He first looked for the most frequent Tracing the first four decades in the life symbol and changed it into the most frequent letter of English, then simi- of suffix trees, their many incarnations, larly inferred the most frequent word, and their applications. then punctuation marks, and so on. Both before and after 1843, the BY ALBERTO APOSTOLICO, MAXIME CROCHEMORE, natural impulse when faced with MARTIN FARACH-COLTON, ZVI GALIL, AND S. MUTHUKRISHNAN some mysterious message has been to count frequencies of individual to- kens or subassemblies in search of a clue. Perhaps one of the most intense and fascinating subjects for this kind 40 Years of scrutiny have been biosequences. As soon as some such sequences be- came available, statistical analysts tried to link characters or blocks of of Suffix Trees characters to relevant biological func- tions. With the early examples of whole genomes emerging in the mid- 1990s, it seemed natural to count the occurrences of all blocks of size 1, 2, and so on, up to any desired length, looking for statistical characteriza- tions of coding regions, promoter re- gions, among others. This article is not about cryptogra- WHEN WILLIAM LEGRAND finally decrypted the string, phy. It is about a data structure and it did not seem to make much more sense than it its variants, and the many surprising and useful features it carries. Among did before. these is the fact that, to set up a sta- tistical table of occurrences for all substrings (also called factors), of any 53‡‡‡305))6*,48264‡.)4z);806”,48†8P60))85;1‡ length, of a text string of n characters, (;:‡*8†83(88)5*†,46(;88*96*?;8)* ‡ (;485);5*†2:* ‡ it only takes time and space linear in (;4956*2(5*Ñ4)8P8*;4069285);)6‡8)4‡‡;1(‡9;48081;8: the length of the text string. While no- body would be so foolish as to solve 8‡1;4885;4)485†528806*81(ddag9;48;(88;4(‡?34; the problem by first generating all 48)4‡;161;:188; ‡?; exponentially many possible strings and then counting their occurrences one by one, a text string may still con- The decoded message read: “A good glass in the tain Θ(n2) distinct substrings, so that tabulating all of them in linear space, bishop’s hostel in the devil’s seat forty-one degrees never mind linear time, already seems and thirteen minutes northeast and by north main puzzling. branch seventh limb east side shoot from the left eye of the death’s-head a bee line from the tree through We dedicate this article to the shot fifty feet out.” But at least it did sound more our friend and colleague, like natural language, and eventually guided the Alberto Apostolico (1948–2015), 36 who passed away on July 20. main character of Edgar Allan Poe’s “The Gold-Bug” He was a major figure in to discover the treasure he had been after. Legrand the development of solved a substitution cipher using symbol frequencies. algorithms on strings. PAPUCHALKA BY IMAGE 66 COMMUNICATIONS OF THE ACM | APRIL 2016 | VOL. 59 | NO. 4 APRIL 2016 | VOL. 59 | NO. 4 | COMMUNICATIONS OF THE ACM 67 review articles Over the years, such structures osequences. Their range of scope ex- Their impact on computer science have held center stage in text search- tends to areas as diverse as detecting and IT at large cannot be overstated. ing, indexing, statistics, and com- plagiarism, finding surprising sub- Text searching and bioinformatics pression as well as in the assembly, strings in a text, testing the unique would not be the same without them. alignment, and comparison of bi- decipherability of a code, and more. In 2013, the Combinatorial Pattern Matching symposium celebrated the Figure 1. The expanded suffix tree of the string x = abcabcaba. 40th anniversary of the appearance of Weiner’s invention of the suffix tree41 c with a special session entirely dedi- a a b cated to that event. $ b b $ History Bits and Pieces c 10 a At the dawn of “stringology,” Donald c a a Knuth conjectured the problem of 9 c $ $ a a $ finding the longest substring com- 6 8 mon to two long text sequences of to- 7 a b b tal length n required (n log n) time. An b a c O(n log n)-time had been provided by a Karp, Miller, and Rosenberg.26 That $ c a a $ construction was destined to play a 3 a role in parallel pattern matching, but b $ 4 Knuth’s conjecture was short lived: in a b 5 1973, Peter Weiner showed the prob- a lem admitted an elegant linear-time $ solution,41 as long as the alphabet of 1 $ the string was fixed. Such a solution 2 was actually a byproduct of a con- struction he had originally set up for a different purpose, that is, identify- ing any substring of a text file with- Figure 2. Building an expanded suffix tree by insertion of consecutive suffixes (showing out specifying all of them. In doing here the insertion of abcaba$). so, Weiner introduced the notion of a textual inverted index that would The insertion of suffixsuf i (i = 1, 2, …, n) consists of two phases. In the first phase, we search forsuf i elicit refinements, analyses, and ap- in Ti – 1. Note the presence of $ guarantees that every suffix will end in a distinct leaf. Therefore, this search will end with failure sooner or later. At that point, we will have identified the longest prefix of plications for 40 years and counting, sufi that has a locus (that is, a terminal node) inT i – 1. Let headi abcab in the example be this prefix a feature hardly shared by any other and α the locus of headi. We can write sufi = headi ∙ taili with taili (a$ in the example) nonempty. In data structure. the second phase, we need to add to Ti – 1 a path leaving node α and labeled taili. This achieves the Weiner’s original construction pro- transformation of Ti – 1 into Ti . cessed the text file from right to left. c As each new character was read in, the a a b structure, which he called a “bi-tree,” b b would be updated to accommodate c longer and longer suffixes of the text c file. Thus, this was an inherently off- a a c line construction, since the text had a to be known in its entirety before the a b b construction could begin. Alterna- b b a tively, one could say the algorithm c would build the structure for the re- $ c $ c verse of the text online. About three 3 years later, Ed McCreight provided a a a 4 left-to-right algorithm and changed b b the name of the structure to “suffix 32 $ a tree,” a name that would stick. 1 $ Let x be a string of n − 1 symbols over some alphabet Σ and $ an extra 2 character not in Σ. The expanded suf- fix tree Tx associated with x is a digital search tree collecting all suffixes of x$. Specifically, Tx is defined as follows. 68 COMMUNICATIONS OF THE ACM | APRIL 2016 | VOL. 59 | NO. 4 review articles 1. Tx has n leaves, labeled from 1 to n. up with an entirely different and el- 2. Each arc is labeled with a symbol egant construction!” In unpublished key insights of Σ ∪{$}. For any i, 1 ≤ i ≤ n, the con- lecture notes of 1975, Vaughan Pratt ˽ The suffix tree is the core data structure catenation of the labels on the path displayed the duality of this structure in string analysis. 37 from the root of Tx to leaf i is precisely and Weiner’s “repetition finder.” ˽ It has a rich history, with connections the suffix McCreight’s algorithm was still in- to compression, matching, automata, data structures and more. sufi = xixi+1…xn−1$. herently offline, and it immediately ˽ There are powerful techniques to build 3. For any two suffixes sufi and sufj triggered a search for an online ver- suffix trees and use them efficiently in of x$, if wij is the longest common pre- sion. Some partial attempts at an on- many applications. fix that sufi and sufj have in common, line algorithm were made, but such then the path in Tx relative to wij is a variant had to wait almost two de- the same for sufi and sufj . cades for Esko Ukkonen’s paper in a string of n characters has only O(n) An example of expanded suffix tree 1995.39 In all these linear-time con- states and edges. Initially coined a is given in Figure 1. structions, linearity was based on directed acyclic word graph (DAWG), The tree can be interpreted as the assumption of a finite alphabet it can even be further reduced if all the state transition diagram of a de- and took Θ(n log n) time without states are terminal states.14 It then ac- terministic finite automaton where that assumption. In 1997, Martin cepts all substrings of the string and all nodes and leaves are final states, Farach introduced an algorithm that is called the factor—substring autom- the root is the initial state, and the abandoned the one suffix-at-time aton. There is a nice relation between labeled arcs, which are assumed to approach prevalent until then; this the index data structures when the point downward, represent part of algorithm gives a linear-time reduc- string has no end-marker and its suf- the state-transition function.