Forty Years of Suffix Trees

King’s Research Portal DOI: 10.1145/2810036 Document Version Peer reviewed version Link to publication record in King's Research Portal Citation for published version (APA): Apostolico, A., Crochemore, M., Farach-Colton, M., Galil, Z., & Muthukrishnan, S. (2016). 40 Years of suffix trees. COMMUNICATIONS OF THE ACM, 59(4), 66-73. https://doi.org/10.1145/2810036 Citing this paper Please note that where the full-text provided on King's Research Portal is the Author Accepted Manuscript or Post-Print version this may differ from the final Published version. If citing, it is advised that you check and use the publisher's definitive version for pagination, volume/issue, and date of publication details. And where the final published version is provided on the Research Portal, if citing you are again advised to check the publisher's website for any subsequent corrections. General rights Copyright and moral rights for the publications made accessible in the Research Portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognize and abide by the legal requirements associated with these rights. •Users may download and print one copy of any publication from the Research Portal for the purpose of private study or research. •You may not further distribute the material or use it for any profit-making activity or commercial gain •You may freely distribute the URL identifying the publication in the Research Portal Take down policy If you believe that this document breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 29. Sep. 2021 Forty Years of Suffix Trees Alberto Apostolico Maxime Crochemore College of Computing King’s College London Georgia Institute of London WC2R 2LS, UK Technology and Université Paris-Est, 801 Atlantic Drive France Atlanta, GA 30332-0280, USA [email protected] [email protected] Martin Farach-Colton Zvi Galil S. Muthukrishnan Department of Computer College of Computing Department of Computer Science Georgia Institute of Science Rutgers University Technology Rutgers University Piscataway, NJ 08854, USA 801 Atlantic Drive Piscataway, NJ 08854, USA [email protected] Atlanta, GA 30332-0280, USA [email protected] [email protected] ABSTRACT Both before and after 1843, the natural impulse when This paper reviews the first 40 years in the life of suffix faced with some mysterious message has been to count fre- trees, their many incarnations, and their applications. The quencies of individual tokens or subassemblies in search of a paper is non-technical but assumes some familiarity with the clue. Perhaps one of the most intense and fascinating sub- structures and constructions discussed. It is not meant to be jects for this kind of scrutiny have been bio-sequences. As exhaustive. It is meant to be a tribute to a ubiquitous tool soon as some such sequences became available, statistical of string matching | the suffix tree and its variants | and analysts tried to link characters or blocks of characters to one of the most persistent subjects of study in the theory of relevant biological functions. With the early examples of algorithms. whole genomes emerging in the mid 1990's, it seemed nat- Keywords: pattern matching, string searching, bi-tree, suf- ural to count the occurrences of all blocks of size 1, 2, etc., fix tree, dawg, suffix automaton, factor automaton, suffix up to any desired length, looking for statistical characteri- array, FM-index, wavelet tree. zations of coding regions, promoter regions, etc. This review is not about cryptography. It is about a data structure and its variants, and the many surprising and use- 1. PROLOG ful features it carries. Among these is the fact that, to set When William Legrand finally decrypted the string: up a statistical table of occurrences for all substrings (also called factors), of any length, of a text string of n charac- 53zzy305))6*,48264z.)4z);806",48y8P60))85;1z ters, it only takes time and space linear in the length of the (;:z*8y83(88)5*y,46(;88*96*?;8)*z(;485);5*y2:*z text string. While nobody would be so foolish as to solve (;4956*2(5*N4)8P8*;4069285);)6~ z8)4zz;1(z9;48081;8: the problem by first generating all exponentially many pos- 8z1;4885;4)485y528806*81(ddag9;48;(88;4(z?34; sible strings and then counting their occurrences one by one, 48)4z;161;:188;z?; a text string may still contain Θ(n2) distinct substrings, so it did not seem to make much more sense than it did before. that tabulating all of them in linear space, never mind linear The decoded message read: \A good glass in the bishop's time, already seems puzzling. hostel in the devil's seat forty-one degrees and thirteen min- Over the years, such structures have held center stage utes northeast and by north main branch seventh limb east in text searching, indexing, statistics, and compression as side shoot from the left eye of the death's-head a bee line well as in the assembly, alignment and comparison of biose- from the tree through the shot fifty feet out." But at least it quences. Their range of scope extends to areas as diverse did sound more like natural language, and eventually guided as detecting plagiarism, finding surprising substrings in a the main character of Edgar Allan Poe's\The Gold Bug"[36] text, testing the unique decipherability of a code, and more. to discover the treasure he had been after. Legrand solved a Their impact on Computer Science and IT at large cannot substitution cipher using symbol frequencies. He first looked be overstated. Text searching and Bioinformatics would not for the most frequent symbol and changed it into the most be the same without them. In 2013, the Combinatorial Pat- frequent letter of English, then similarly inferred the most tern Matching symposium celebrated the 40th anniversary frequent word, then punctuation marks, and so on. of the appearance of Weiner's paper [41] with a special ses- Permission to make digital or hard copies of all or part of this work for sion entirely dedicated to that event. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to 2. HISTORY BITS AND PIECES republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. At the dawn of \stringology," Don Knuth conjectured that . the problem of finding the longest substring common to two long text sequences of total length n required Ω(n log n) time. An O(n log n)-time had been provided by Karp, Miller and Rosenberg [26]. That construction was destined to play c a role in parallel pattern matching, but Knuth's conjecture a a was short lived: in 1973, Peter Weiner showed that the prob- $ b b lem admitted an elegant linear-time solution [41], as long as b $ 10 a c the alphabet of the string was fixed. Such a solution was c a a actually a byproduct of a construction he had originally set $ 9 c $ a up for a different purpose, i.e., identifying any substring of a a $ 6 b text file without specifying all of them. In doing so, Weiner b 7 a 8 b a introduced the notion of a textual inverted index that would c a elicit refinements, analyses and applications for forty years a $ and counting, a feature hardly shared by any other data a $ c a 3 structure. b 4 $ Weiner's original construction processed the text file from a b 5 right to left. As each new character was read in, the struc- $ a ture, which he called a\bi-tree", would be updated to accom- 1 $ modate longer and longer suffixes of the text file. Thus this 2 was an inherently off-line construction, since the text had to be known in its entirety before the construction could begin. Alternatively, one could say that the algorithm would build the structure for the reverse of the text on-line. About three years later, Ed McCreight provided a left-to-right algorithm and changed the name of the structure to \suffix tree," a name that would stick [32]. Figure 1: The expanded suffix tree of the string x = Let x be a string of n − 1 symbols over some alphabet abcabcaba Σ and $ an extra character not in Σ. The expanded suffix tree Tx associated with x is a digital search tree collecting all suffixes of x$. Specifically, T is defined as follows. x this is done, it becomes possible to aim for an expectedly 1. T has n leaves, labeled from 1 to n. non-trivial O(n) time construction. x At the CPM Conference of 2013, McCreight revealed that 2. Each arc is labeled with a symbol of Σ [ f$g. For any his O(n) time construction was not born as an alternative i, 1 ≤ i ≤ n, the concatenation of the labels on the to Weiner's: he had developed it in an effort to understand path from the root of Tx to leaf i is precisely the suffix Weiner's paper, but when he showed it to Weiner asking him sufi = xixi+1:::xn−1$. to confirm that he had understood that paper the answer was "No, but you have come up with an entirely different 3.

Load more