12/29/17

Key take home messages

• Suffix , Suffix Text Algorithms (6EAP) – Application examples

Full text indexing •

Jaak Vilo • Burrows Wheeler Transform (BWT) 2017 fall – Compact Suffix tree representation: BWT + succinct

Jaak Vilo MTAT.03.190 Text Algorithms 1

Problem E.g. Dictionary problem

• Given P and S – find all exact or approximate • Does P belong to a dictionary D={d1,…,dn} occurrences of P in S – Build a binary of D – B-Tree of D • You are allowed to preprocess S (and P, of – Hashing course) – Sorting + Binary search • Build a keyword trie: search in O(|P|) • Goal: to speed up the searches – Assuming alphabet has up to a constant size c – See Aho-Corasick algorithm, Trie construction

Sorted array and binary search Sorted array and binary search

1 13 1 13

global info global info head search head search he stop he stop header informal header informal hers hers index index happy his show happy his show

O( |P| log n )

1 12/29/17

Trie for D={ he, hers, his, she} S != of words

0 • S of length n h s

1 3 • How to index? i e h

2 6 4 • Index from every position of a text r s e

8 5 s 7 • Prefix of every possible suffix is important

9 O( |P| )

Suffix tree

• Definition: A compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with Trie(babaab) their parents. • Definition ( suffix tree ). A suffix tree T for a string S (with n = |S|) is a a b babaab rooted, labeled tree with a leaf for each non-empty suffix of S. b a Furthermore, a suffix tree satisfies the following properties: abaab a a b • Each internal node, other than the root, has at least two children; baab b • Each edge leaving a particular node is labeled with a non-empty aab a a a of S of which the first symbol is unique among all first symbols of the edge ab b b a labels of the edges leaving this particular node; b • For any leaf in the tree, the concatenation of the edge labels on the path b from the root to this leaf exactly spells out a non-empty suffix of s. • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: and . Hardcover - 534 pages 1st edition (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198.

Literature on suffix trees • http://en.wikipedia.org/wiki/Suffix_tree • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Hardcover - 534 pages 1st edition (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. (pages: http://stackoverflow.com/questions/9452701/u 89--208) kkonens-suffix-tree-algorithm-in-plain-english • E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249-60, 1995. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751 • Ching-Fung Cheung, Jeffrey Xu Yu, Hongjun Lu. "Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 1, pp. 90-105, January, 2005. http://www2.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.3 • CPM articles archive: http://www.cs.ucr.edu/~stelo/cpm/ • Mark Nelson. Fast String Searching With Suffix Trees Dr. Dobb's Journal, August, 1996. http://www.dogma.net/markn/articles/suffixt/suffixt.htm

2 12/29/17

Partly based on :

Suffix tree and suffix array techniques for pattern ttttttttttttttgagacggagtctcgctctg analysis in strings tcgcccaggctggagtgcagtggcggg Esko Ukkonen atctcggctcactgcaagctccgcctcc Univ Helsinki cgggttcacgccattctcctgcctcagcc Erice School 30 Oct 2005 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt tcccaagtagctgggactacaggcgcc cgccactacgcccggctaattttttgtattt High-throughput genome-scale sequence analysis and ttagtagagacggggtttcaccgttttagc mapping using compressed data structures Veli Mäkinen cgggatggtctcgatctcctgacctcgtg Department of Computer Science atccgcccgcctcggcctcccaaagtgc University of Helsinki tgggattacaggcgt

13 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt14

Solution: backtracking with suffix Analysis of a string of symbols tree • T = h a t t i v a t t i ’text’ • P = a t t ’pattern’

• Find the occurrences of P in T: h a t t i v a t t i

• Pattern synthesis: #(t) = 4 #(atti) = 2

#(t****t) = 2 ...ACACATTATCACAGGCATCGGCATTAGCGATCGAGTCG.....

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt15 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 16

Pattern finding & synthesis problems

• T = t1t2 … tn, P = p1 p 2 … pn , strings of symbols in finite alphabet 1. Suffix tree • Indexing problem: Preprocess T (build an index structure) such that the occurrences of different 2. Suffix array patterns P can be found fast – static text, any given pattern P 3. Some applications • Pattern synthesis problem: Learn from T new patterns that occur surprisingly often 4. Finding motifs

• What is a pattern? Exact substring, approximate substring, with generalized symbols, with gaps, … 17 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt18

3 12/29/17

The suffix tree Tree(T) of T Suffix trie and suffix tree

• abaab suffix tree, Tree(T), is baab Trie(abaab) compacted trie that represents all the suffixes aab of string T ab a b b b • linear size: |Tree(T)| = O(|T|) a a a b a • can be constructed in linear time O(|T|) a b • has myriad virtues (A. Apostolico) b • is well-known: 366 000 Google hits

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt19 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt20

Suffix trie and suffix tree Trie(T) can be large

abaab • 2 baab |Trie(T)| = O(|T| ) aab • bad example: T = anbn ab b • Trie(T) can be seen as a DFA: language Trie(abaab) Tree(abaab) accepted = the suffixes of T

a b a • minimize the DFA => directed cyclic word baab b a a ab graph (’DAWG’) a baab b a a b b

21 22

Tree(T) is of linear size Tree(hattivatti) hattivatti vatti • attivatti only the internal branching nodes and the t i vatti leaves represented explicitly ttivatti ti i tivatti atti i • edges labeled by of T vatti ivatti hattivatti • v = node(α) if the path from root to v spells α ivatti vatti ti • vatti vatti one-to-one correspondence of leaves and atti vatti tti suffixes tti atti tivatti ti hattivatti ttivatti • |T| leaves, hence < |T| internal nodes attivatti • |Tree(T)| = O(|T| + size(edge labels)) i

23 24

4 12/29/17

Tree(hattivatti) Tree(hattivatti) hattivatti hattivatti vatti vatti attivatti i attivatti i t vatti 3,3 6 ttivatti ttivatti ti i 4,5 10 tivatti atti i tivatti 2,5 i hattivatti vatti vatti ivatti ivatti hattivatti ivatti vatti ti vatti 9 5 vatti vatti atti vatti 6,10 vatti tti atti 6,10 8 tti tti atti 7 tivatti 4 ti hattivatti ttivatti ti 1 3 attivatti 2 i substring labels of i edges represented as hattivatti pairs of pointers hattivatti 25 26

Tree(T) is full text index Find att from Tree(hattivatti) hattivatti Tree(T) vatti attivatti t vatti ttivatti P ti i tivatti atti i P occurs in T at vatti ivatti hattivatti locations 8, 31, … ivatti vatti ti vatti vatti atti vatti tti 31 8 tti atti 7 tivatti ti hattivatti ttivatti P occurs in T ó P is a prefix of some suffix of T attivatti ó Path for P exists in Tree(T) i 2 All occurrences of P in time O(|P| + #occ) 27 28

Linear time construction of Tree(T) On-line construction of Trie(T)

hattivatti • T = t1t2 … tn$ attivatti • P = t t … t i:th prefix of T ttivatti i 1 2 i tivatti • on-line idea: update Trie(Pi) to Trie(Pi+1)

ivatti Weiner McCreight • => very simple construction vatti (1973), (1976) atti ’algorithm tti of the ti year’

’on-line’ algorithm i

(Ukkonen 1992) 29 30

5 12/29/17

Trie(abaab) Trie(abaab)

Trie(a) Trie(ab) Trie(aba) Trie(abaa) abaa a a b a b baa a a b a b a b aa b b b b b a a εa a a a ε a a a a chain of links connects the end points of current suffixes

31 32

Trie(abaab) Trie(abaab)

Trie(abaa) Trie(abaa)

a a b a b a b a a b a b a b b b b b b b a a a a a a a a a a a a a a

Add next symbol = b Add next symbol = b

From here on b-arc already exists

33 34

Trie(abaab) What happens in Trie(Pi) => Trie(Pi+1) ?

Before a a b a b a b a From here on the ai-arc i exists already => stop b b b a a a updating here a a a a Trie(abaab) After ai a b ai ai a b a i a ai a b a New suffix links a b New nodes b

35 36

6 12/29/17

What happens in Trie(Pi) => Trie(Pi+1) ? On-line procedure for suffix trie

• time: O(size of Trie(T)) 1. Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix links slink(v) := root and slink(root) := root • suffix links: 2. for i := 2 to n do begin slink(node(aα)) = node(α) 3. vi-1 := leaf of Trie(t1…ti-1) for string t1…ti-1 (i.e., the deepest leaf)

4. v := vi-1; v´ := 0

5. while node v has no outgoing arc for ti do begin

6. Create a new node v´´ and an arc son(v,ti) = v´´ 7. if v´ ≠ 0 then slink(v) := v´´ 8. v := slink(v); v´ := v´´ end

9. for the node v´´ such that v´´= son(v,ti) do if v´´ = v´ then slink(v’) := root else slink(v´) := v´´

37 38

Suffix trees on-line Implicit and real nodes

• ’compacted version’ of the on-line trie • Pair (v,α) is an implicit node in Tree(T) if v is a construction: simulate the construction on the node of Tree and α is a (proper) prefix of the linear size tree instead of the trie => time label of some arc from v. If α is the empty O(|T|) string then (v, α) is a ’real’ node (= v). • all trie nodes are conceptually still needed => • Let v = node(α´) in Tree(T). Then implicit node implicit and real nodes (v, α) represents node(α´α) of Trie(T)

39 40

Implicit node Suffix links and open arcs

root

… α … α´ aα

v slink(v) v … α … (v, α) label [i,*] instead of [i,j] if w is a leaf and j is the scanned position of T

w

41 42

7 12/29/17

Big picture On-line procedure for suffix tree

Input: string T = t1t2 … tn$ suffix link path traversed: total work O(n) Output: Tree(T) Notation: son(v,α) = w iff there is an arc from v to w with label α … son(v,ε) = v

… Function Canonize(v, α): … … … while son(v, α´) ≠ 0 where α = α´ α´´, | α´| > 0 do v := son(v, α´); α := α´´ return (v, α) new arcs and nodes created:

total work O(size(Tree(T)) 43 44

Suffix-tree on-line: main procedure

Create Tree(t1); slink(root) := root (v, α) := (root, ε) /* (v, α) is the start node */ for i := 2 to n+1 do v´ := 0 http://stackoverflow.com/questions/9452701/u

while there is no arc from v with label prefix αti do kkonens-suffix-tree-algorithm-in-plain-english if α ≠ ε then /* divide the arc w = son(v, αη) into two */

son(v, α) := v´´; son(v´´,ti) := v´´´; son(v´´,η) := w else

son(v,ti) := v´´´; v´´ := v if v´ ≠ 0 then slink(v´) := v´´ v´ := v´´; v := slink(v); (v, α) := Canonize(v, α) if v´ ≠ 0 then slink(v´) := v

(v, α) := Canonize(v, αti) /* (v, α) = start node of the next round */ 45

The actual time and space abc

• |Tree(T)| is about 20|T| in practice • brute-force construction is O(|T|log|T|) for random strings as the average depth of internal nodes is O(log|T|) • difference between linear and brute-force constructions not necessarily large (Giegerich & Kurtz) • truncated suffix trees: k symbols long prefix of each suffix represented (Na et al. 2003) • alphabet independent linear time (Farach 1997)

47

8 12/29/17

abcabxabcd

Applications of Suffix Trees

• Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Hardcover - 534 pages 1st edition (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. - book

• APL1: Exact String Matching Search for P from text S. Solution 1: build STree(S) - one achieves the same O(n+m) as Knuth- Morris-Pratt, for example! • Search from the suffix tree is O(|P|) • APL2: Exact set matching Search for a set of patterns P

9 12/29/17

Back to backtracking Applications of Suffix Trees

ACA, 1 mismatch • APL3: substring problem for a database of patterns A T Given a set of strings S=S1, ... , Sn --- a database Find C all Si that have P as a substring A T C A AT T A • contains all suffixes of all Si T C C C T T T • Query in time O(|P|), and can identify the LONGEST 4 2 1 5 6 3 common prefix of P in all Si

Same idea can be used to many other forms of approximate search, like Smith-Waterman, position-restricted scoring matrices, search, etc.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 55

Applications of Suffix Trees Simple analysis task: LCSS

• APL4: Longest common substring of two strings • Let LCSSA(A,B) denote the longest common • Find the longest common substring of S and T. substring two sequences A and B. E.g.: • Overall there are potentially O( n2 ) such substrings, if – LCSS(AGATCTATCT,CGCCTCTATG)=TCTAT. n is the length of a shorter of S and T • A good solution is to build suffix tree for the • once (1970) conjectured that linear- shorter sequence and make a descending time algorithm is impossible. suffix walk with the other sequence. • Solution: construct the STree(S+T) and find the node deepest in the tree that has suffixes from both S and T in subtree leaves. • Ex: S= superiorcalifornialives T= sealiver have both a substring alive. ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 58

Suffix link Descending suffix walk

suffix tree of A Read B left-to-right, always going down in the tree when possible. X If the next symbol of B does not match any edge label on current position, take aX suffix link, and try again. (Suffix link in the root suffix link to itself emits a symbol). The node v encountered with largest string depth v is the solution.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 59 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 60

10 12/29/17

Another common tool: Generalized Applications of Suffix Trees suffix tree

• APL5: Recognizing DNA contamination Related A node info: C subtree size 47813871 to DNA sequencing, search for longest strings sequence count 87 (longer than threshold) that are present in the C DB of sequences of other genomes. • APL6: Common substrings of more than two strings Generalization of APL4, can be done in linear (in total length of all strings) time ACCTTA....ACCT#CACATT..CAT#TGTCGT...GTA#TCACCACC...C$

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 62

Generalized suffix tree application Case study continued

suffix tree of genome

motif? TAC...... T A node info: subtree size 4398 C 5 blue blue sequences 12/15 1 red C red sequences 2/62

...ACC..#...ACC...#...ACC...ACC..ACC..#..ACC..ACC...#...ACC...#...... #....#...#...#...ACC...#...#...#...#...#...#..#..ACC..ACC...#...... #... genome regions with ChIP-seq matches

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 63 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 64

Applications of Suffix Trees Applications of Suffix Trees

• APL7: Building a directed graph for exact • APL8: A reverse role for suffix trees, and major matching: Suffix graph - directed acyclic word space reduction Index the pattern, not tree... graph (DAWG), a smallest finite state • Matching statistics. automaton recognizing all suffixes of a string • APL10: All-pairs suffix-prefix matching For all S. This automaton can recognize membership, pairs S , S , find the longest matching suffix- but not tell which suffix was matched. i j prefix pair. Used in shortest common • Construction: merge isomorfic subtrees. superstring generation (e.g. DNA sequence • Isomorfic in Suffix Tree when exists suffix link assembly), EST alignmentm etc. path, and subtrees have equal nr. of leaves.

11 12/29/17

Applications of Suffix Trees Applications of Suffix Trees

• APL11: Finding all maximal repetitive • APL14: Suffix trees in genome-scale projects structures in linear time • APL15: A Boyer-Moore approach to exact set • APL12: Circular string linearization e.g. circular matching chemical molecules in the database, one • APL16: Ziv-Lempel wants to lienarize them in a canonical way... • APL17: Minimum length encoding of DNA • APL13: Suffix arrays - more space reduction will touch that separately

Applications of Suffix Trees Applications of Suffix Trees

• Additional applications Mostly exercises... • APL: Exact matching with wild cards • Extra feature: CONSTANT time retrieval (LCA) Andmestruktuur mis võimaldab leida konstantse ajaga alumist ühist • APL: The k-mismatch problem vanemat (see vastab pikimale ühisele prefixile!) on võimalik koostada lineaarse ajaga. • Approximate and repeats • APL: Longest common extension: a bridge to inexact matching • APL: Finding all maximal palindromes in linear time • Faster methods for tandem repeats reads from central position the same to left and right. E.g.: kirik, saippuakivikauppias. • A linear-time solution to the multiple common • Build the suffix tree of S and inverted S (aabcbad => aabcbad#dabcbaa ) substring problem and using the LCA one can ask for any position pair (i, 2i-1), the longest common prefix in constant time. • And many-many more ... • The whole problem can be solved in O(n).

Properties of suffix tree... in Properties of suffix tree practice • Suffix tree has n leaves and at most n-1 • Huge overhead due to pointer structure: internal nodes, where n is the total length of – Standard implementation of suffix tree for human all sequences indexed. genome requires over 200 GB memory! • Each node requires constant number of – A careful implementation (using log n -bit fields integers (pointers to first child, sibling, parent, for each value and array layout for the tree) still text range of incoming edge, statistics requires over 40 GB. counters, etc.). – itself takes less than 1 GB using 2- bits per bp. • Can be constructed in linear time.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 71 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 72

12 12/29/17

Suffixes - sorted hattivatti ε attivatti atti 1. Suffix tree ttivatti attivatti tivatti hattivatti 2. Suffix array ivatti i vatti ivatti 3. Some applications atti ti tti tivatti ti tti 4. Finding motifs i ttivatti ε vatti

• Sort all suffixes. Allows to perform binary search! 73 74

Suffix array: example Suffix array construction: sort! 1 hattivatti ε 11 hattivatti 1 11 2 attivatti atti 7 attivatti 2 7 3 ttivatti attivatti 2 ttivatti 3 2 4 tivatti hattivatti 1 tivatti 4 1 5 ivatti i 10 ivatti 5 10 6 vatti ivatti 5 vatti 6 5 7 atti ti 9 atti 7 9 8 tti tivatti 4 tti 8 4 9 ti tti 8 ti 9 8 10 i ttivatti 3 i 10 3 11 ε vatti 6 ε 11 6

• suffix array = lexicographic order of the suffixes • suffix array = lexicographic order of the suffixes 75 76

Suffix array Reducing space: suffix array

• suffix array SA(T) = an array giving the =[2,2] A T =[3,3] lexicographic order of the suffixes of T C

• space requirement: 5|T| =[5,6] A C T=[3,6] T A A A T =[6,6] =[4,6] • practitioners like suffix arrays (simplicity, T C C =[2,6] C T T T space efficiency) suffix 4 2 1 5 6 3 array • theoreticians like suffix trees (explicit structure) C A T A C T 1 2 3 4 5 6

77 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 78

13 12/29/17

Suffix array Pattern search from suffix array hattivatti ε 11 • Many algorithms on suffix tree can be attivatti atti 7 att simulated using suffix array... ttivatti attivatti 2 binary search tivatti hattivatti 1 – ... and couple of additional arrays... ivatti i 10 – ... forming so-called enhanced suffix array... vatti ivatti 5 – ... leading to the similar space requirement as atti ti 9 careful implementation of suffix tree tti tivatti 4 ti tti 8 • Not a satisfactory solution to the space issue. i ttivatti 3 ε vatti 6

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 79 80

What we learn today? Recent suffix array constructions

• We learn that it is possible to replace suffix • Manber&Myers (1990): O(|T|log|T|) trees with compressed suffix trees that take • linear time via suffix tree 8.8 GB for the human genome. • January / June 2003: direct linear time • We learn that backtracking can be done using construction of suffix array compressed suffix arrays requiring only 2.1 GB - Kim, Sim, Park, Park (CPM03) for the human genome. - Kärkkäinen & Sanders (ICALP03) • We learn that discovering interesting motif - Ko & Aluru (CPM03) seeds from the human genome takes 40 hours and requires 9.3 GB space.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 81 82

Kärkkäinen-Sanders algorithm Notation

1.Construct the suffix array of the suffixes • string T = T[0,n) = t0t1 … tn-1 starting at positions i mod 3 ≠ 0. This is • suffix S = T[i,0) = t t … t done by reduction to the suffix array i i i+1 n-1 construction of a string of two thirds the • for C \subset [0,n]: SC = {Si | i in C} length, which is solved recursively. 2.Construct the suffix array of the remaining suffixes using the result of the first step. • suffix array SA[0,n] of T is a permutation of 3.Merge the two suffix arrays into one. [0,n] satisfying SSA[0] < SSA[1] < … < SSA[n]

83 84

14 12/29/17

Running example Step 0: Construct a sample

0 1 2 3 4 5 6 7 8 9 10 11 • for k = 0,1,2 • T[0,n) = y a b b a d a b b a d o 0 0 … Bk = {i є [0,n] | i mod 3 = k} • C = B1 U B2 sample positions

• SA = (12,1,6,4,9,3,8,2,7,5,10,11,0) • SC sample suffixes

• Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11}

85 86

Step 1: Sort sample suffixes Step 1 (cont.) • for k = 1,2, construct • once the sample suffixes are sorted, assign a rank to Rk = [tktk+1tk+2] [tk+3tk+4tk+5]… [tmaxBktmaxBk+1tmaxBk+2] each: rank(Si) = the rank of Si in SC; rank(Sn+1) = rank(S ) = 0 R = R1 ^ R2 concatenation of R1 and R2 n+2 • Example: Suffixes of R correspond to SC: suffix [titi+1ti+2]… corresponds to Si ; R = [abb][ada][bba][do0][bba][dab][bad][o00] correspondence is order preserving. R´ = (1,2,4,6,4,5,3,7)

Sort the suffixes of R: radix sort the characters and rename with SAR´ = (8,0,1,6,4,2,5,3,7) ranks to obtain R´. If all characters different, their order directly rank(S ) - 1 4 - 2 6 - 5 3 – 7 8 – 0 0 gives the order of suffixes. Otherwise, sort the suffixes of R´ using i Kärkkäinen-Sanders. Note: |R´| = 2n/3.

87 88

Step 2: Sort nonsample suffixes Step 3: Merge • merge the two sorted sets of suffixes using a standard • for each non-sample Si є SB0 (note that comparison-based merging:

rank(Si+1) is always defined for i є B0): • to compare Si є SC with Sj є SB0, distinguish two cases: Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) • i є B1: S ≤ S ↔ (t ,rank(S )) ≤ (t ,rank(S )) • radix sort the pairs (ti,rank(Si+1)). i j i i+1 j j+1 • i є B2: Si ≤ Sj ↔ (ti,ti+1,rank(Si+2)) ≤ (tj,tj+1,rank(Sj+2))

• Example: S12 < S6 < S9 < S3 < S0 because (0,0) < • note that the ranks are defined in all cases! (a,5) < (a,7) < (b,2) < (y,1) • S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) < (b,a,7)

89 90

15 12/29/17

Running time O(n) Implementation

• excluding the recursive call, everything can be • about 50 lines of C++ done in linear time • code available e.g. via Juha Kärkkäinen’s home • the recursion is on a string of length 2n/3 page • thus the time is given by recurrence T(n) = T(2n/3) + O(n) • hence T(n) = O(n)

91 92

LCP table

• Longest Common Prefix of successive • Oxford English Disctionary http://www.oed.com/ • Example - Word of the Day , Fourth elements of suffix array: http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/L7_SuffixTrees/wotd_fourth.html http://www.oed.com/cgi/display/wotd • LCP[i] = length of the longest common prefix • PAT index - by Gaston Gonnet (ta on samuti Maple tarkvara üks loojatest ning hiljem molekulaarbioloogia tarkvarapaketi väljatöötajaid) of suffixes SSA[i] and SSA[i+1] • PAT index is essentially a suffix array. To save space, indexed only from first • build inverse array SA-1 from SA in linear time character of every word • XML-tagging (or SGML, at that time!) also indexed -1 • then LCP table from SA in linear time (Kasai • To mark certain fields of XML, the bit vectors were used. et al, CPM2001) • Main concern - improve the speed of search on the CD - minimize random accesses. • For slow medium even 15-20 accesses is too slow... • G. H. Gonnet, R. A. Baeza-Yates, and T. Snider, Lexicographical indices for text: Inverted files vs. PAT trees, Technical Report OED-91-01, Centre for 93 the New OED, University of Waterloo, 1991.

Suffix tree vs suffix array

1. Suffix tree • suffix tree ó suffix array + LCP table 2. Suffix array 3. Some applications 4. Finding motifs

95 96

16 12/29/17

Substring motifs of string T Counting the substring motifs

• string T = t1 … tn in alphabet A. • internal nodes of Tree(T) ↔ repeating • Problem: what are the frequently occurring substrings of T (ungapped) substrings of T? Longest substring • number of leaves of the subtree of a node for that occurs at least q times? string P = number of occurrences of P in T • Thm: Suffix tree Tree(T) gives complete occurrence counts of all substring motifs of T in O(n) time (although T may have O(n2) substrings!)

97 98

Substring motifs of hattivatti Finding repeats in DNA

vatti • t i vatti human chromosome 3 4 2 ti i • the first 48 999 930 bases atti i hattivatti 2 2 vatti • 31 min cpu time (8 processors, 4 GB) ivatti 2 ti vatti vatti 9 vatti tti • Human genome: 3x10 bases atti tivatti • Tree(HumanGenome) feasible hattivatti ttivatti attivatti

Counts for the O(n) maximal motifs shown

99 100

Longest repeat? Ten occurrences?

Occurrences at: 28395980, 28401554r Length: 2559 ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggat ctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcc ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaat gctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaac caagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagt atgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcata gtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatac agagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccg gtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatca cccgcctcggcctcccaaagtgctgggattacaggcgt ccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgcca ttctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttga gaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtag gttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggctttt gttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgta agtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaag Length: 277 ggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcaga tagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatag tttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaatt ctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgt Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, gtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattct 47398125 ctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagacttt gctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcc In the reversed complement at: 17858493, 41463059, 42431718, taattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgcca 42580925 gttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttatt gagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttat tgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattg aggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagtta gg 101 102

17 12/29/17

Using suffix trees: approximate Using suffix trees: plagiarism matching • find longest common substring of strings X and Y • : insertions, deletions, changes • build Tree(X$Y) and find the deepest node which has a leaf pointing to X and another • STOCKHOLM vs TUKHOLMA pointing to Y

103 104

Dynamic programming String distance/similarity functions di,j = min(if ai=bj then di-1,j-1 else ¥, di-1,j + 1, di,j-1 + 1) STOCKHOLM vs TUKHOLMA = distance between i-prefix of A and j-prefix of B (substitution excluded) STOCKHOLM_ B _TU_ KHOLMA mxn table d bj => 2 deletions, 1 insertion, 1 change

di-1,j-1 di-1,j A +1 a d d i i,j-1 +1 i,j

dm,n 105 106

di,j = min(if ai=bj then di-1,j-1 else ¥, Search problem di-1,j + 1, di,j-1 + 1) • find approximate occurrences of pattern P in A\B s t o c k h o l m 0 1 2 3 4 5 6 7 8 9 text T: substrings P’ of T such that d(P,P’) small t 1 2 1 2 3 4 5 6 7 8 • dyn progr with small modification: O(mn) u 2 3 2 3 4 5 6 7 8 9 • k 3 4 3 4 5 4 5 6 7 8 lots of (practical) improvement tricks h 4 5 4 5 6 5 4 5 6 7 o 5 6 5 4 5 6 5 4 5 6 l 6 7 6 5 6 7 6 5 4 5 T P’ m 7 8 7 6 7 8 7 6 5 4 a 8 9 8 7 8 9 8 7 6 5 P

optimal alignment by trace-back 107 dID(A,B) 108

18 12/29/17

Index for approximate searching? Burrows-Wheeler Transformation

• dynamic programming: P x Tree(T) with backtracking • BWT for text compression and indexing

Tree(T)

P

109

Burrows-Wheeler

• See FAQ http://www.faqs.org/faqs/compression-faq/part2/section-9.html • The method described in the original paper is really a composite of three different algorithms: – the block sorting main engine (a lossless, very slightly expansive preprocessor), – the move-to-front coder (a byte-for-byte simple, fast, locally adaptive noncompressive coder) and – a simple statistical compressor (first order Huffman is mentioned as a candidate) eventually doing the compression. • Of these three methods only the first two are discussed here as they are what constitutes the heart of the algorithm. These two algorithms combined form a completely reversible (lossless) transformation that - with typical input - skews the first order symbol distributions to make the data more compressible with simple methods. Intuitively speaking, the method transforms slack in the higher order probabilities of the input block (thus making them more even, whitening them) to slack in the lower order statistics. This effect is what is seen in the histogram of the resulting symbol data. • Please, read the article by Mark Nelson: • Data Compression with the Burrows-Wheeler Transform Mark Nelson, Dr. Dobb's Journal September, 1996. http://marknelson.us/1996/09/01/bwt/

BWT

19 12/29/17

0 1 6 1 4 5 * => 5 2 4 0 1 6 3 2 0 D R D O B B S 1 2 6 3 3 4 5 4 0 2 5 D R D 3 6

20 12/29/17

Example

CODE: t: hat acts like this:<13><10><1 t: hat buffer to the constructor • Decode: errktreteoe.e t: hat corrupted the , or wo W: hat goes up must come down<13 t: hat happens, it isn't likely w: hat if you want to dynamicall • t: hat indicates an error.<13><1 Hint: . Is the last character, t: hat it removes arguments from alphabetically first… t: hat looks like this:<13><10>< t: hat looks something like this t: hat looks something like this t: hat once I detect the mangled

21