An Introduction to Suffix Trees and Indexing

Introduction Basic Definitions Dictionaries Suffix tree Example Overview An introduction to suffix trees and indexing Tomáˇs Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012 Introduction Basic Definitions Dictionaries Suffix tree Example Overview 1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet and strings 3 Dictionaries Trie Patricia tree 4 Suffix tree Suffix trie Suffix tree Ukkonen’s algorithm 5 Example 6 Overview Introduction Basic Definitions Dictionaries Suffix tree Example Overview Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview Introduction Basic Definitions Dictionaries Suffix tree Example Overview Introduction Introduction Two main problem areas in text retrieval 1 String matching 2 Indexing and querying Introduction Basic Definitions Dictionaries Suffix tree Example Overview Introduction Introduction Two main problem areas in text retrieval 1 String matching 2 Indexing and querying Exact and approximate cases! Introduction Basic Definitions Dictionaries Suffix tree Example Overview Introduction Exact string matching Many efficient algorithms exist Knuth-Morris-Pratt algorithm Boyer-Moore, Boyer-Moore-Horspool, Turbo-Boyer-Moore, etc. Aho-Corasick ... Introduction Basic Definitions Dictionaries Suffix tree Example Overview Introduction Indexing - 1 Problem Given a text T , we need to construct an efficient data structure D which will serve as an index of T , so that we can efficiently query text T . What do we expect from an efficient indexing data structure? Introduction Basic Definitions Dictionaries Suffix tree Example Overview Introduction Indexing - 2 Given a query pattern P, we want to find all occurrences of P in preprocessed text T using the indexing data structure D The data structure D is efficient if It can be built in linear time in the size of T (O(|T |)) It occupies space linear in the size of T (O(|T |)) It can answer a query whether P exists in T in time linear in the size of P (O(|P|)) It can report all occurrences of P in T in time O(|P| + occ), where occ is the number of occurrences Introduction Basic Definitions Dictionaries Suffix tree Example Overview Introduction Indexing - 2 Some efficient indexing data structures include Suffix automata (DAWG) and variations such as CDAWG Suffix trees Position heaps Suffix arrays In this lecture we will concentrate only on suffix trees Introduction Basic Definitions Dictionaries Suffix tree Example Overview Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview Introduction Basic Definitions Dictionaries Suffix tree Example Overview Graph theory Graph, Cycle, Path Graph A graph is a pair G =(V , E) of sets such that E ⊆ V × V . 2 3 1 4 6 5 Introduction Basic Definitions Dictionaries Suffix tree Example Overview Graph theory Graph, Cycle, Path Graph A graph is a pair G =(V , E) of sets such that E ⊆ V × V . 2 3 1 4 6 5 Introduction Basic Definitions Dictionaries Suffix tree Example Overview Graph theory Graph, Cycle, Path Graph A graph is a pair G =(V , E) of sets such that E ⊆ V × V . Path A path of length n in a graph G =(V , E) is a sequence v0, v1,... vn ∈ V such that (v0, v1), (v1, v2),..., (vn−1, vn) ∈ E. 2 3 1 4 6 5 Introduction Basic Definitions Dictionaries Suffix tree Example Overview Graph theory Graph, Cycle, Path Graph A graph is a pair G =(V , E) of sets such that E ⊆ V × V . Path A path of length n in a graph G =(V , E) is a sequence v0, v1,... vn ∈ V such that (v0, v1), (v1, v2),..., (vn−1, vn) ∈ E. 2 3 1 4 6 5 Introduction Basic Definitions Dictionaries Suffix tree Example Overview Graph theory Graph, Cycle, Path Graph A graph is a pair G =(V , E) of sets such that E ⊆ V × V . Path A path of length n in a graph G =(V , E) is a sequence v0, v1,... vn ∈ V such that (v0, v1), (v1, v2),..., (vn−1, vn) ∈ E. Cycle A path v0, v1,... vn, v0, where n ≥ 2, is called a cycle. 2 3 1 4 6 5 Introduction Basic Definitions Dictionaries Suffix tree Example Overview Graph theory Graph, Cycle, Path Graph A graph is a pair G =(V , E) of sets such that E ⊆ V × V . Path A path of length n in a graph G =(V , E) is a sequence v0, v1,... vn ∈ V such that (v0, v1), (v1, v2),..., (vn−1, vn) ∈ E. Cycle A path v0, v1,... vn, v0, where n ≥ 2, is called a cycle. 2 3 1 4 6 5 Introduction Basic Definitions Dictionaries Suffix tree Example Overview Graph theory Rooted tree, subtree, tree height, node height Tree A rooted tree is an acyclic graph T =(V , E) with a special vertex v ∈ V called the root. Nodes with degree 1 are called leaves. Introduction Basic Definitions Dictionaries Suffix tree Example Overview Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Introduction Basic Definitions Dictionaries Suffix tree Example Overview Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. Introduction Basic Definitions Dictionaries Suffix tree Example Overview Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. Introduction Basic Definitions Dictionaries Suffix tree Example Overview Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The set of all possible strings on the alphabet Σ is denoted by Σ∗. Introduction Basic Definitions Dictionaries Suffix tree Example Overview Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The set of all possible strings on the alphabet Σ is denoted by Σ∗. Definition (Length of string) The length of a string x is defined as the length of the sequence associated with the string x, and is denoted by |x|. Introduction Basic Definitions Dictionaries Suffix tree Example Overview Alphabet and strings Alphabet and strings We denote by x[i], for all 1 ≤ i ≤|x|, the letter at index i of x. We also call index i, for all 1 ≤ i ≤|x|, a position in x when x 6= ε. It follows that the ith letter of x is the letter at position i in x, and that x = x[1 .. |x|] Introduction Basic Definitions Dictionaries Suffix tree Example Overview Alphabet and strings Alphabet and strings We denote by x[i], for all 1 ≤ i ≤|x|, the letter at index i of x. We also call index i, for all 1 ≤ i ≤|x|, a position in x when x 6= ε. It follows that the ith letter of x is the letter at position i in x, and that x = x[1 .. |x|] Definition (Factor of string) A string x is a factor (substring) of a string y if there exist two strings u and v, such that y = uxv. We denote the factor (substring) of x starting at position i and ending at position j as x[i .. j]. Introduction Basic Definitions Dictionaries Suffix tree Example Overview Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview Introduction Basic Definitions Dictionaries Suffix tree Example Overview Trie Trie Retrieval Construct a dictionary for the set of words {amy, andy, ann, rob, roger, ben, betty} a A r b B C D m n e o E F J M g y d n n t b G H I K L N O y t e P Q S y r R T Introduction Basic Definitions Dictionaries Suffix tree Example Overview Trie Trie Retrieval Construct a dictionary for the set of words {amy, andy, ann, rob, roger, ben, betty} a A r b B C D m n e o E F J M g y d n n t b G H I K L N O $ y $ $ t $ e P Q S $ y r R T $ $ Introduction Basic Definitions Dictionaries Suffix tree Example Overview Patricia tree Patricia tree 1 Construct a trie 2 Remove nodes with out-degree 1 and concatenate the labels of the corresponding edges to one edge a A r b B C D m n e o E F J M g y d n n t b G H I K L N O y t e P Q S y r R T Introduction Basic Definitions Dictionaries Suffix tree Example Overview Patricia tree Patricia tree 1 Construct a trie 2 Remove nodes with out-degree 1 and concatenate the labels of the corresponding edges to one edge a A r b B C D m n e o E F J M g y d n n t b G H I K L N O y t e P Q S y r R T Introduction Basic Definitions Dictionaries Suffix tree Example Overview Patricia tree Patricia tree 1 Construct a trie 2 Remove nodes with out-degree 1 and concatenate the labels of the corresponding edges to one edge a A ro B be n my F J M n n b dy G I K N tty ger P R T Introduction Basic Definitions Dictionaries Suffix tree Example Overview Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview Introduction Basic Definitions Dictionaries Suffix tree Example Overview Suffix trie Suffix trie Given some text, i.e.

Load more