<<

Introduction Basic Definitions Dictionaries Suffix Example Overview

An introduction to suffix trees and indexing

Tom´aˇs Flouri Solon P. Pissis

Heidelberg Institute for Theoretical Studies

December 3, 2012 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

1 Introduction Introduction

2 Basic Definitions Alphabet and strings

3 Dictionaries Patricia tree

4 Suffix tree Suffix trie Suffix tree Ukkonen’s algorithm

5 Example

6 Overview Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Contents

1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Introduction Introduction

Two main problem areas in text retrieval

1 String matching 2 Indexing and querying Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Introduction Introduction

Two main problem areas in text retrieval

1 String matching 2 Indexing and querying

Exact and approximate cases! Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Introduction Exact string matching

Many efficient algorithms exist

Knuth-Morris-Pratt algorithm Boyer-Moore, Boyer-Moore-Horspool, Turbo-Boyer-Moore, etc. Aho-Corasick ... Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Introduction Indexing - 1

Problem Given a text T , we need to construct an efficient D which will serve as an index of T , so that we can efficiently query text T .

What do we expect from an efficient indexing data structure? Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Introduction Indexing - 2

Given a query pattern P, we want to find all occurrences of P in preprocessed text T using the indexing data structure D

The data structure D is efficient if It can be built in linear time in the size of T (O(|T |)) It occupies space linear in the size of T (O(|T |)) It can answer a query whether P exists in T in time linear in the size of P (O(|P|)) It can report all occurrences of P in T in time O(|P| + occ), where occ is the number of occurrences Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Introduction Indexing - 2

Some efficient indexing data structures include

Suffix automata (DAWG) and variations such as CDAWG Suffix trees Position heaps Suffix arrays

In this lecture we will concentrate only on suffix trees Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Contents

1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory Graph, Cycle, Path

Graph A graph is a pair G =(V , E) of sets such that E ⊆ V × V .

2 3

1 4

6 5 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory Graph, Cycle, Path

Graph A graph is a pair G =(V , E) of sets such that E ⊆ V × V .

2 3

1 4

6 5 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory Graph, Cycle, Path

Graph A graph is a pair G =(V , E) of sets such that E ⊆ V × V .

Path A path of length n in a graph G =(V , E) is a sequence v0, v1,... vn ∈ V such that (v0, v1), (v1, v2),..., (vn−1, vn) ∈ E.

2 3

1 4

6 5 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory Graph, Cycle, Path

Graph A graph is a pair G =(V , E) of sets such that E ⊆ V × V .

Path A path of length n in a graph G =(V , E) is a sequence v0, v1,... vn ∈ V such that (v0, v1), (v1, v2),..., (vn−1, vn) ∈ E.

2 3

1 4

6 5 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory Graph, Cycle, Path

Graph A graph is a pair G =(V , E) of sets such that E ⊆ V × V .

Path A path of length n in a graph G =(V , E) is a sequence v0, v1,... vn ∈ V such that (v0, v1), (v1, v2),..., (vn−1, vn) ∈ E.

Cycle

A path v0, v1,... vn, v0, where n ≥ 2, is called a cycle.

2 3

1 4

6 5 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory Graph, Cycle, Path

Graph A graph is a pair G =(V , E) of sets such that E ⊆ V × V .

Path A path of length n in a graph G =(V , E) is a sequence v0, v1,... vn ∈ V such that (v0, v1), (v1, v2),..., (vn−1, vn) ∈ E.

Cycle

A path v0, v1,... vn, v0, where n ≥ 2, is called a cycle.

2 3

1 4

6 5 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Graph theory Rooted tree, subtree, tree height, node height

Tree A rooted tree is an acyclic graph T =(V , E) with a special vertex v ∈ V called the root. Nodes with degree 1 are called leaves. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings Alphabet and strings

Definition (Alphabet) An alphabet Σ is a finite non-empty whose elements are called letters. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings Alphabet and strings

Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters.

Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings Alphabet and strings

Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters.

Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings Alphabet and strings

Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters.

Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The set of all possible strings on the alphabet Σ is denoted by Σ∗. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings Alphabet and strings

Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters.

Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The set of all possible strings on the alphabet Σ is denoted by Σ∗. Definition (Length of string) The length of a string x is defined as the length of the sequence associated with the string x, and is denoted by |x|. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings Alphabet and strings

We denote by x[i], for all 1 ≤ i ≤|x|, the letter at index i of x. We also call index i, for all 1 ≤ i ≤|x|, a position in x when x 6= ε. It follows that the ith letter of x is the letter at position i in x, and that x = x[1 .. |x|] Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Alphabet and strings Alphabet and strings

We denote by x[i], for all 1 ≤ i ≤|x|, the letter at index i of x. We also call index i, for all 1 ≤ i ≤|x|, a position in x when x 6= ε. It follows that the ith letter of x is the letter at position i in x, and that x = x[1 .. |x|]

Definition (Factor of string) A string x is a factor () of a string y if there exist two strings u and v, such that y = uxv.

We denote the factor (substring) of x starting at position i and ending at position j as x[i .. j]. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Contents

1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Trie Trie Retrieval Construct a dictionary for the set of words {amy, andy, ann, rob, roger, ben, betty}

a A r b

B C D m n e o

E F J M g y d n n t b

G H I K L N O y t e

P Q S y r

R T Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Trie Trie Retrieval Construct a dictionary for the set of words {amy, andy, ann, rob, roger, ben, betty}

a A r b

B C D m n e o

E F J M g y d n n t b

G H I K L N O $ y $ $ t $ e P Q S $ y r R T $ $ Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Patricia tree Patricia tree

1 Construct a trie 2 Remove nodes with out-degree 1 and concatenate the labels of the corresponding edges to one edge

a A r b

B C D m n e o

E F J M g y d n n t b

G H I K L N O y t e

P Q S y r

R T Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Patricia tree Patricia tree

1 Construct a trie 2 Remove nodes with out-degree 1 and concatenate the labels of the corresponding edges to one edge

a A r b

B C D m n e o

E F J M g y d n n t b

G H I K L N O y t e

P Q S y r

R T Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Patricia tree Patricia tree

1 Construct a trie 2 Remove nodes with out-degree 1 and concatenate the labels of the corresponding edges to one edge

a A ro

B be n my F J M n n b dy G I K N tty ger

P

R T Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Contents

1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix trie Suffix trie

Given some text, i.e. t = banana, construct the suffix trie. 1 Generate the set Suff(t) 2 Construct a trie from Suff(t) The resulting data structure is called a suffix trie.

Example Given the t = banana$, the set Suff(t) is

Suff(t)= {banana$, anana$, nana$, ana$, na$, a$} Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix trie Suffix trie - Example

Given the text t = banana$, construct the suffix trie.

a n b

$ n a a

6 n a n $

5 n $ a a

4 a n $

3 $ a

2 $

1 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree Suffix tree

Definition A suffix tree is a patricia tree of the suffix trie.

Construction 1 Construct a suffix trie of text x 2 Eliminate all nodes with out-degree 1 and concatenate the labels in the corresponding edges to one edge. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree Suffix tree - Example

a n b

$ n a a

6 n a n $

5 n $ a a

4 a n $

3 $ a

2 $

1 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree Suffix tree - Example

a n b

$ n a a

6 n a n $

5 n $ a a

4 a n $

3 $ a

2 $

1 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree Suffix tree - Example

a na

$ na 6 $

5 banana$ $ na$

4 na$

3

2

1 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree Size of suffix tree

Theorem A suffix tree consists of at most 2n − 1 nodes (or 2n if empty suffix $ is taken into account).

Proof (by induction) Base case For 2 leaves we have 1 internal node. Inductive step Assume that any with m < N leaves consists of at exactly m − 1 internal nodes. We must prove that a binary tree with N leaves has exactly N − 1 internal nodes. A binary tree with N leaves is made up of: A root node. A left binary tree with k leaves. A right binary tree with N − k leaves. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree Size of suffix tree

Proof (by induction) According to the induction assumption The left binary tree with k leaves consists of k − 1 internal nodes. The right binary tree with N − k leaves consists of N − k − 1 internal nodes. Therefore, the total number of internal nodes in a binary tree with N leaves is (k − 1)+(N − k − 1)+ 1 = N − 1 and thus, the total number of nodes is 2N − 1.  Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree Suffix tree construction algorithms

Weiner’s algorithm (1973) Introduced as position tree Construction in linear time (for constant size alphabets) Characterized as algorithm of the year McCreight’s algorithm (1976) Improved space requirements over Weiner’s method Construction in linear time (for constant size alphabets) Ukkonen’s algorithm (1995) Same time and space requirements as McCreight’s Easier to understand On-line Farach’s algorithm (1997) Linear time construction algorithm for any type of alphabet Hard to implement The basis for new algorithms i.e. position heaps and suffix arrays in linear time Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Implicit suffix tree

Definition An implicit suffix tree for string x is a tree obtained from the suffix tree of x by 1 Removing $ from all edge labels 2 Removing any edge that has no label 3 Removing any node with only one child

a na a na

$ na banana$ $ na$ na banana na

6 5 3 6 5 3 $ na$ na

4 2 1 4 2 1 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Implicit suffix tree

Definition An implicit suffix tree for string x is a tree obtained from the suffix tree of x by 1 Removing $ from all edge labels 2 Removing any edge that has no label 3 Removing any node with only one child

a na a na

$ na banana$ $ na$ na banana na

6 5 3 3 $ na$ na

4 2 1 2 1 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Implicit suffix tree

Definition An implicit suffix tree for string x is a tree obtained from the suffix tree of x by 1 Removing $ from all edge labels 2 Removing any edge that has no label 3 Removing any node with only one child

a na nana

anana $ na banana$ $ na$ banana

6 5 3 3 $ na$

4 2 1 2 1 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Implicit suffix tree

The implicit suffix tree of a string is what results by applying Ukkonen’s algorithm to the string without an added end marker $.

All suffixes are included, but not necessarily as labels of complete paths leading to leaves.

By appending a unique at the end of the string (in our case the $), the implicit suffix tree is essentially the same as the (true) suffix tree (only without $). Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm String paths of implicit suffix trees

Given a string y[1 .. n], an implicit suffix tree Ii contains each suffix y[1 .. i], y[2 .. i],..., y[i] of y as a label of some path (possibly ending at the middle of an edge)

That is, a string path is

a string that can be matched along the edges, starting from the root, or equivalently a prefix of any node label Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Ukkonen’s algorithm

1 Start with T = I1.

2 Consecutively update T to I2, I3,..., In+1 in n phases, where Ii represents the implicit suffix tree of prefix y[1 .. i].

Phase i + 1 updates T from Ii (with all suffixes of y[1 .. i]) to Ii+1 (with all suffixes of y[1 .. i + 1]). Each phase i + 1 consists of extensions j = 1, 2,..., i + 1 (one for each suffix of y[1 .. i + 1]). Extension j ensures that suffix y[j .. i + 1] is in Ii+1. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Suffix extension rules

Rule 1 y[j .. i] ends at a leaf Insert y[i + 1] at the end of the edge label Rule 2 y[j .. i] doesn’t end at a leaf, and the following character is not y[i + 1] Connect the end of the path to a new leaf j by an edge labeled y[i + 1]. If the path ended at the middle of an edge, split that edge and insert a new node as the parent of leaf j. Rule 3 If the path y[j .. i] is already in the tree. No update. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Complexity

Complexity The so-far presented algorithmic approach runs in O(n3).

Proof Consider a single phase i + 1. Each extension rule can be applied in O(1) ⇒ Applying all i + 1 extensions takes time Θ(i). Locating the ends of string paths y[1 .. i],..., y[i] by traversing i 2 the edge labels takes time Σk=1 = Θ(i ). ⇒ Therefore, the total time for all phases i = 1, 2,..., n is n 2 3 Σi=1i = Θ(n ) Which is even worse than the naive algorithm which runs in O(n2). We will see how this approach, with the use of some simple tricks, can achieve linear run-time. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Suffix links

The extensions of phase i + 1 need to locate the ends of all i + 1 suffixes of y[1 .. i], and apply Rules 1-3.

How to do this efficiently?

For each internal node v of Ii labeled xα, where x ∈ Σ and α ∈ Σ∗, define s(v) to be the node labeled by α. (Do these nodes actually exist?)

Then a pointer from v to s(v) is called the suffix link of v.

Note: If node v is labeled by a single character then α = ε and s(v) is the root node. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Example of suffix links

Suffix tree for x = xabxac

bxac xa c a

3 6 c bxac c bxac

5 2 4 1 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Why do we need suffix links?

Extension j (of phase i + 1) finds the end of the path y[j .. i] in the tree (and extends it with character y[i + 1])

Extension j + 1 similarly finds the end of the path y[j + 1 .. i]

Assume that v is an internal node whose string path y[j]α is (essentialy) a prefix of y[j .. i]. Then we can avoid traversing path α when locating the end of path y[j + 1 .. i], by starting from node s(v). Do suffix links always exist? Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Suffix links existence

Observation If an internal node v is created during extension j (of phase i + 1), then extension j + 1 will find out the node s(v).

Let v be labeled xα Node v can only be created by extension Rule 2. That is, v is inserted at the end of path y[j .. i], which continued by some character c 6= y[i + 1]. ⇒ Therefore, paths xαc and αc have been entered before phase i + 1. ⇒ in extension j + 1, node s(v) is either found or created at the end of path α = y[j + 1 .. i]. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Speeding up path traversals

Consider extensions of phase i + 1

Extension 1 extends path y[1 .. i] with character y[i + 1].

Extension 1 is easy as path y[1 .. i] always ends at leaf 1, and is thus extended by Rule 1.

We can perform extension 1 in constant time, if we maintain a pointer to the edge at the end of y[1 .. i].

What about subsequent extensions j + 1 (for j = 1, 2,..., i)? Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Locating subsequent paths

Extension j has located the end of path y[j .. i] and v is the node last visited.

Starting from there, walk up at most one node either 1 to the root, or 2 to a node s(v) with a suffix link from v

In case of (1), traverse path y[j + 1 .. i] explicitly down-wards from the root. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Locating subsequent paths

In case of (2), let xα be the label of v ⇒ y[j .. i]= xαβ for some β ∈ Σ∗

Then follow the suffix link of v, and continue by matching β down-wards from node s(v) (whose string-path is α).

Having found the end of path αβ = y[j + 1 .. i], apply extension rules to ensure that it extends with y[i + 1].

Finally, if a new internal node w was created in extension j, set its suffix link to point to the end node of path y[j + 1 .. i] Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Locating subsequent paths - Illustration

In case of (2), let xα be the label of v ⇒ y[j .. i]= xαβ for some β ∈ Σ∗ (in this case β = abcd)

α xα

s(v) a

v bc abcd

d Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Speeding up explicit traversals

Skip/Count trick In phase i + 1, each path y[j .. i], which is followed in extension j, is known to exist in the tree ⇒ The path can be followed by choosing the correct edges, instead of examining every character Let y[k] be the next character to be matched on path y[j .. i] Now an edge labeled by y[p .. q] can be traversed simply by checking that y[p]= y[k], and skipping the next q − p characters of y[j .. i] ⇒ The time to traverse a path is proportional to the number of nodes on the path (instead of its string length) Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Speeding up explicit traversals

Lemma For any node v with a suffix link to s(v), it holds that

depth(v) − 1 ≤ depth(s(v)) ≤ depth(v)

Sketch of proof The suffix links for any ancestor of v lead to distinct ancestors of s(v). Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Linear bound for any single phase

Theorem Using suffix links and the skip/count trick, a single phase i takes time O(n)

Proof There are i + 1 ≤ n + 1 extensions in phase i + 1 In any extension, other work except tree-traversal (that is, extension rules) takes O(1) time only How to bound the work for traversing the tree? To find the end of the next path, an extension first moves at most one level up. Then a suffix link may be followed, which is followed by a down-traversal to match the rest of the path Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Linear bound for any single phase

The up-walk in any extension decreases the current node depth by at most one (since it moves up at most one node) and each suffix link traversal decreases the node-depth by at most another one (previous Lemma). ⇒ Thus the current node depth is decremented at most 2n times during the entire phase. On the other hand, the current node depth cannot exceed n ⇒ it is incremented (by following downward edges) at most 3n times ⇒ total run-time of a phase is thus O(n)

Improvement Since there are n phases, the total run-time is O(n2) Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Final improvements (1)

Some extensions can be found unnecessary to compute explicitly

Observation 1 - Rule 3 terminates current phase If path y[j .. i + 1] is already in the tree, so are paths y[j + 1 .. i + 1] ... y[i + 1]

⇒ Phase i + 1 can be finished at the first extension j that applies Rule 3 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Final improvements (2)

Observation 2 - Once a leaf, always a leaf A node created as a leaf remains a leaf thereafter because no extension rule adds children to a leaf. If extension j created a leaf (numbered j), extension j of any later phase i + 1 applies Rule 1 (appending the next character y[i + 1] to label of the edge ending at leaf j.

Explicit applications of Rule 1 can be eliminated as follows:

Use “compressed” edge representation (i.e. indices p and q instead of substring y[p .. q]), and represent the end position of each terminal edge by a global value e, for the “current end position” (phase). Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Eliminating extensions

Denote by ji the last non-void extension of phase i (that is, application of Rule 1 or 2)

Obs 1 ⇒ extensions 1,..., ji of phase i are non-void ⇒ leaves 1,..., ji have been created at the end of phase i

Obs 2 ⇒ extensions 1,..., ji of any subsequent phase all apply Rule 1

⇒ ji+1 ≥ ji

⇒ Execute only extensions ji + 1, ji + 2,... explicitly in phase i + 1 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Single phase algorithm

Algorithm for phase i + 1 with unnecessary extensions eliminated 1 Set e = i + 1 (implements extensions 1,..., ji implicitly ∗ ∗ 2 Compute extensions ji + 1,..., j until j > i + 1 or Rule 3 was applied in extension j∗ ∗ 3 Set ji+1 = j − 1 (for the next phase)

All these tricks together can be shown to lead to linear run-time Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Complexity of the tuned implementation (1)

Theorem Ukkonen’s algorithm builds the suffix tree for y[1 .. n] in time O(n), when implemented using the mentioned tricks.

Proof

The extensions computed explicitly in any two phases i and i + 1 are disjoint except for extension j∗, which may be computed anew in phase i + 1.

The second computation of extension j∗ can be done in O(1) by remembering the end of the path entered in the previous computation Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Complexity of the tuned implementation (2)

Let j = 1,..., n + 1 denote the index of the current extension

Over all phases 2,..., n + 1 index j never decreases, but it can remain the same at the start of phases 3,..., n + 1 ⇒ at most 2n extensions are computed explicitly.

Similarly to the previous proof (skip/count), the current node depth can be decremented at most 4n times, and thus the total length of all downward traversals is bounded by 5n  Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Obtaining the true suffix tree

Finally, the implicit suffix tree In+1 can be converted to the true suffix tree of y[1 .. n]$ in the following way

All occurrences of the “current end position” marker e on edge labels can be replaced by n + 1 (with a simple , in time O(n)) Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Ukkonen’s algorithm Ukkonen’s algorith

Reads a string x of size n from left to right.

The algorithm is on-line, i.e. at step 1 ≤ i ≤ n it constructs an implicit suffix tree of prefix y[1 .. i] which can then be easily converted to the (true) suffix tree by appending a unique symbol $ that has not appeared before.

Runs in O(n) time for constant-size alphabets or O(n log n) for general alphabets.

Requires O(n) space. Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Contents

1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Phase 1 y = abcabxabc $ Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = a bcabxabc $ Rule 2

(1, e)

1 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Phase 2 y = a b cabxabc $

(1, e)

1 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Implicit y = a b cabxabc $

(1, e)

1 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Explicit y = a b cabxabc $

(1, e)

(2, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Phase 3 y = a b c abxabc $

(1, e)

(2, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Implicit y = a b c abxabc $

(1, e)

(2, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Implicit y = a b c abxabc $

(1, e)

(2, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Explicit y = a b c abxabc $

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Phase 4 y = a b c a bx abc $

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Implicit y = a b c a bx abc $

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Implicit y = a b c a bx abc $

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Implicit y = a b c a bx abc $

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = a b c a bx abc $ Rule 3

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Phase 5 y = ab c a b x ab c $ ↑

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Implicit y = ab c a b x ab c $ ↑

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Implicit y = a b c a b x ab c $ ↑

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Implicit y = a b c a b x ab c $ ↑

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = a b c a b x ab c $ Rule 3 ↑

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Phase 6 y = abcab x a b c $ ↑

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Skip all y = abcab x a b c $ implicit ↑

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = a b c a b x a b c $ Rule 2 ↑

(1, e) (3, e)

(2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = a b c a b x a b c $ Rule 2 ↑

(1, 2)

(3, e)

(3, e) (2, e) 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = a b c a b x a b c $ Rule 2 ↑

(1, 2)

(3, e)

(6, e) (3, e) (2, e) 4 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = ab c a b x a b c $ Rule 2 ↑

(1, 2)

(3, e)

(6, e) (3, e) (2, e) 4 3

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = ab c a b x a b c $ Rule 2 ↑

(1, 2)

(2, 2) (3, e)

(6, e) (3, e) 4 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = ab c a b x a b c $ Rule 2 ↑

(1, 2)

(2, 2) (3, e)

(6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Create y = ab c a b x a b c $ suffix link ↑

(1, 2)

(2, 2) (3, e)

(6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Create y = ab c a b x a b c $ suffix link ↑

(1, 2)

(2, 2) (3, e)

(6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 y = abcab x a b c $ ↑

(1, 2)

(2, 2) (3, e)

(6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcab x a b c $ Rule 2 ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Phase 7 y = abcabx a b c $

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Skip all y = abcabx a b c $ implicit

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabx a b c $ Rule 3

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Phase 8 y = abcabxa b c $ ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Skip all y = abcabxa b c $ implicit ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabx a b c $ Rule 3 ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Phase 9 y = abcabxab c $ ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Skip all y = abcabxab c $ implicit ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabx a b c $ Rule 3 ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 Phase 10 y = abcabxabc $ ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Skip all y = abcabxabc $ implicit ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabx a b c $ Rule 3 ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (6, e) (6, e) (3, e) 4 5 3

(3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabx a b c $ Rule 2 ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (3, 3) (6, e) (6, e)

4 5 3 (4, e) (3, e)

1 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabx a b c $ Rule 2 ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (3, 3) (6, e) (6, e)

4 5 3 (4, e) (10, e) (3, e)

1 7 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Follow y = abcabxa b c $ suffix link ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (3, 3) (6, e) (6, e)

4 5 3 (4, e) (10, e) (3, e)

1 7 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabxa b c $ Rule 2 ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (3, 3) (6, e) (6, e)

4 5 3 (4, e) (10, e) (3, e)

1 7 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabxa b c $ Rule 2 ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (3, 3) (6, e)(3, 3) (6, e)

4 5 3 (4, e) (10, e) (3, e) 1 7 2 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabxa b c $ Rule 2 ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (3, 3) (6, e)(3, 3) (6, e)

4 5 3 (4, e) (10, e) (10, e) (3, e) 1 7 2 8 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Follow y = abcabxab c $ suffix link ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (3, 3) (6, e)(3, 3) (6, e)

4 5 3 (4, e) (10, e) (10, e) (3, e) 1 7 2 8 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabxab c $ Rule 2 ↑

(1, 2) (6, e)

(2, 2) (3, e) 6 (3, 3) (6, e)(3, 3) (6, e)

4 5 3 (4, e) (10, e) (10, e) (3, e) 1 7 2 8 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabxab c $ Rule 2 ↑

(1, 2) (6, e)

(2, 2) (3, 3)

6 (3, 3) (6, e)(3, 3) (6, e) (4, e)

4 5 3 (4, e) (10, e) (10, e) (3, e) 1 7 2 8 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabxab c $ Rule 2 ↑

(1, 2) (6, e)

(2, 2) (3, 3)

6 (3, 3) (6, e)(3, 3) (6, e) (4, e) (10, e) 4 5 9 3 (4, e) (10, e) (10, e) (3, e) 1 7 2 8 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Create y = abcabxab c $ suffix link ↑

(1, 2) (6, e)

(2, 2) (3, 3)

6 (3, 3) (6, e)(3, 3) (6, e) (4, e) (10, e) 4 5 9 3 (4, e) (10, e) (10, e) (3, e) 1 7 2 8 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Create y = abcabxab c $ suffix link ↑

(1, 2) (6, e)

(2, 2) (3, 3)

6 (3, 3) (6, e)(3, 3) (6, e) (4, e) (10, e) 4 5 9 3 (4, e) (10, e) (10, e) (3, e) 1 7 2 8 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

1 2 34 5 6 7 8 910 y = abcabxabc $ ↑

(1, 2) (6, e)

(2, 2) (3, 3)

6 (3, 3) (6, e)(3, 3) (6, e) (4, e) (10, e) 4 5 9 3 (4, e) (10, e) (10, e) (3, e) 1 7 2 8 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Suffix tree - Example

↓ 1 2 34 5 6 7 8 910 Explicit - y = abcabxabc $ Rule 2 ↑

(1, 2) (6, e)

(3, 3) (2, 2) (10, e) 10 6 (3, 3) (6, e)(3, 3) (6, e) (4, e) (10, e) 4 5 9 3 (4, e) (10, e) (10, e) (3, e) 1 7 2 8 Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Application - finding all occurrences of a query

1 2 34 5 6 7 8 910 y = abcabxabc $

ab xabc$

b c $

10 6 c xabc$ c xabc$ $ abxabc$

4 5 9 3 abxabc$ $ $ abxabc$ 1 7 2 8

Query the string a Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Application - finding all occurrences of a query

Find the node to 1 2 34 5 6 7 8 910 which the string y = abcabxabc $ path a leads to

ab xabc$

b c $

10 6 c xabc$ c xabc$ $ abxabc$

4 5 9 3 abxabc$ $ $ abxabc$ 1 7 2 8

Query the string a Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Application - finding all occurrences of a query

Get the leafs of 1 2 34 5 6 7 8 910 that node y = abcabxabc $

ab xabc$

b c $

10 6 c xabc$ c xabc$ $ abxabc$

4 5 9 3 abxabc$ $ $ abxabc$ 1 7 2 8

Query the string a Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Application - finding all occurrences of a query

↓ ↓ ↓ Leaves indicate 1 2 34 5 6 7 8 910 the starting posi- y = abcabxabc $ tions of a

ab xabc$

b c $

10 6 c xabc$ c xabc$ $ abxabc$

4 5 9 3 abxabc$ $ $ abxabc$ 1 7 2 8

Query the string a Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Contents

1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Overview

We had a quick look on indexing. Preprocessing a given text Efficient querying afterwards

We’ve seen what suffix trees are and some of their properties. Patricia suffix for a string x[1 .. n] At most 2n − 1 nodes Exactly n leaves

We’ve seen Ukkonen’s algorithm. Fairly simple to understand Linear time construction for constant-size alphabets Introduction Basic Definitions Dictionaries Suffix tree Example Overview

Reminder - Next week

Next week’s lecture will take place at

SR 148, Building 50.34