Suffix and Factor Automata and Combinatorics on Words

Suffix and Factor Automata and Combinatorics on Words Gabriele Fici Workshop PRIN 2007–2009 Varese – 5 September 2011 Gabriele Fici Suffix and Factor Automata and Combinatorics on Words The Suffix Automaton Definition (A. Blumer et al. 85 — M. Crochemore 86) The Suffix Automaton of the word w is the minimal deterministic automaton recognizing the suffixes of w. Example The SA of w = aabbabb a a b b a b b 0 1 2 3 4 5 6 7 a b b 3′′ 4′′ b a b 3′ Gabriele Fici Suffix and Factor Automata and Combinatorics on Words The SA has several applications, for example in pattern matching music retrieval spam detection search of characteristic expressions in literary works speech recordings alignment ... Algorithmic Construction The SA allows the search of a pattern v in a text w in time and space O(jvj). Moreover: Theorem (A. Blumer et al. 85 — M. Crochemore 86) The SA of a word w over a fixed alphabet Σ can be built in time and space O(jwj). Gabriele Fici Suffix and Factor Automata and Combinatorics on Words Algorithmic Construction The SA allows the search of a pattern v in a text w in time and space O(jvj). Moreover: Theorem (A. Blumer et al. 85 — M. Crochemore 86) The SA of a word w over a fixed alphabet Σ can be built in time and space O(jwj). The SA has several applications, for example in pattern matching music retrieval spam detection search of characteristic expressions in literary works speech recordings alignment ... Gabriele Fici Suffix and Factor Automata and Combinatorics on Words Determinize by subset construction: a a b b a b b {0, 1, 2,..., 7} {1, 2, 5} {2} {3} {4} {5} {6} {7} a b b {3, 6} {4, 7} b b a {3, 4, 6, 7} One Way to Build the SA Build a naif non-deterministic automaton: w = aabbabb a a b b a b b 0 1 2 3 4 5 6 7 Gabriele Fici Suffix and Factor Automata and Combinatorics on Words a a b b a b b {0, 1, 2,..., 7} {1, 2, 5} {2} {3} {4} {5} {6} {7} a b b {3, 6} {4, 7} b b a {3, 4, 6, 7} One Way to Build the SA Build a naif non-deterministic automaton: w = aabbabb a a b b a b b 0 1 2 3 4 5 6 7 Determinize by subset construction: Gabriele Fici Suffix and Factor Automata and Combinatorics on Words Define on Fact(w) the equivalence: u ∼SA v () Endsetw (u) = Endsetw (v) Ending Positions We associate to each factor v of w the set of ending positions of v in w. We note this set Endsetw (v). Example w = a a b b a b b 1 2 3 4 5 6 7 Endsetw (ba) = f5g, Endsetw (abb) = Endsetw (bb) = f4; 7g. Gabriele Fici Suffix and Factor Automata and Combinatorics on Words Ending Positions We associate to each factor v of w the set of ending positions of v in w. We note this set Endsetw (v). Example w = a a b b a b b 1 2 3 4 5 6 7 Endsetw (ba) = f5g, Endsetw (abb) = Endsetw (bb) = f4; 7g. Define on Fact(w) the equivalence: u ∼SA v () Endsetw (u) = Endsetw (v) Gabriele Fici Suffix and Factor Automata and Combinatorics on Words Ending Positions u ∼SA v () Endsetw (u) = Endsetw (v) Remark ∗ u ∼SA v if and only if for any z 2 Σ one has uz 2 Suff(w) () vz 2 Suff(w) Remark Fact(w)= ∼SA is in bijection with the set of states of the SA of w. Gabriele Fici Suffix and Factor Automata and Combinatorics on Words The bounds are well known: jwj + 1 ≤ jSA(w)j ≤ 2jwj − 1 The upper bound is reached for w = abjw|−1, with a 6= b. And for the lower bound? Problem (J. Berstel and M. Crochemore) Characterize the languageL SA of words such that jSA(w)j = jwj + 1. The Number of States of the SA The number of states (classes) of the SA of w is therefore jSA(w)j = j Fact(w)= ∼SA j Gabriele Fici Suffix and Factor Automata and Combinatorics on Words The upper bound is reached for w = abjw|−1, with a 6= b. And for the lower bound? Problem (J. Berstel and M. Crochemore) Characterize the languageL SA of words such that jSA(w)j = jwj + 1. The Number of States of the SA The number of states (classes) of the SA of w is therefore jSA(w)j = j Fact(w)= ∼SA j The bounds are well known: jwj + 1 ≤ jSA(w)j ≤ 2jwj − 1 Gabriele Fici Suffix and Factor Automata and Combinatorics on Words And for the lower bound? Problem (J. Berstel and M. Crochemore) Characterize the languageL SA of words such that jSA(w)j = jwj + 1. The Number of States of the SA The number of states (classes) of the SA of w is therefore jSA(w)j = j Fact(w)= ∼SA j The bounds are well known: jwj + 1 ≤ jSA(w)j ≤ 2jwj − 1 The upper bound is reached for w = abjw|−1, with a 6= b. Gabriele Fici Suffix and Factor Automata and Combinatorics on Words Problem (J. Berstel and M. Crochemore) Characterize the languageL SA of words such that jSA(w)j = jwj + 1. The Number of States of the SA The number of states (classes) of the SA of w is therefore jSA(w)j = j Fact(w)= ∼SA j The bounds are well known: jwj + 1 ≤ jSA(w)j ≤ 2jwj − 1 The upper bound is reached for w = abjw|−1, with a 6= b. And for the lower bound? Gabriele Fici Suffix and Factor Automata and Combinatorics on Words The Number of States of the SA The number of states (classes) of the SA of w is therefore jSA(w)j = j Fact(w)= ∼SA j The bounds are well known: jwj + 1 ≤ jSA(w)j ≤ 2jwj − 1 The upper bound is reached for w = abjw|−1, with a 6= b. And for the lower bound? Problem (J. Berstel and M. Crochemore) Characterize the languageL SA of words such that jSA(w)j = jwj + 1. Gabriele Fici Suffix and Factor Automata and Combinatorics on Words b is right special a and b are bispecial Example w = aabbabb ab is left special Special Factors Definition v is a left special factor of w if there exist a 6= b such that av and bv are factors of w v is a right special factor of w if there exist a 6= b such that va and vb are factors of w v is a bispecial factor of w if it is both left and right special Gabriele Fici Suffix and Factor Automata and Combinatorics on Words b is right special a and b are bispecial Special Factors Definition v is a left special factor of w if there exist a 6= b such that av and bv are factors of w v is a right special factor of w if there exist a 6= b such that va and vb are factors of w v is a bispecial factor of w if it is both left and right special Example w = aabbabb ab is left special Gabriele Fici Suffix and Factor Automata and Combinatorics on Words a and b are bispecial Special Factors Definition v is a left special factor of w if there exist a 6= b such that av and bv are factors of w v is a right special factor of w if there exist a 6= b such that va and vb are factors of w v is a bispecial factor of w if it is both left and right special Example w = aabbabb ab is left special b is right special Gabriele Fici Suffix and Factor Automata and Combinatorics on Words Special Factors Definition v is a left special factor of w if there exist a 6= b such that av and bv are factors of w v is a right special factor of w if there exist a 6= b such that va and vb are factors of w v is a bispecial factor of w if it is both left and right special Example w = aabbabb ab is left special b is right special a and b are bispecial Gabriele Fici Suffix and Factor Automata and Combinatorics on Words Example (w = aabbabb) L Sw = 5 since the left special factors of w are , a; b; ab; abb Pw = 2 since a is left special in w L jSA(w)j = jwj + 1 + Sw − Pw = 7 + 1 + 5 − 2 = 11 The Number of States of the SA Theorem (G. Fici 09) L jSA(w)j = jwj + 1 + Sw − Pw L Sw = number of left special factors of w Pw = length of the shortest prefix of w which is not left special a a b b a b b 0 1 2 3 4 5 6 7 a b b 3′′ 4′′ b a b 3′ Gabriele Fici Suffix and Factor Automata and Combinatorics on Words The Number of States of the SA Theorem (G. Fici 09) L jSA(w)j = jwj + 1 + Sw − Pw L Sw = number of left special factors of w Pw = length of the shortest prefix of w which is not left special Example (w = aabbabb) a a b b a b b 0 1 2 3 4 5 6 7 a b b 3′′ 4′′ b a b L Sw = 5 since the left special factors3′ of w are , a; b; ab; abb Pw = 2 since a is left special in w L jSA(w)j = jwj + 1 + Sw − Pw = 7 + 1 + 5 − 2 = 11 Gabriele Fici Suffix and Factor Automata and Combinatorics on Words If jΣj = 2, LSA is the set of finite prefixes of standard Sturmian words, i.e., the set of left special factors of Sturmian words.

Suffix and Factor Automata and Combinatorics on Words

Suffix Structure Lecture Moscow International Workshop ACM ICPC

Algorithms on Strings

Finite Automata Implementations Considering CPU Cache J

Starting the Matching from the End Enables Long Shifts. • the Horspool

Text Searching Algorithms

Text Searching: Theory and Practice 1 Introduction

The Finite Automata Approaches in Stringology

Suffix Arrays

Sliding Suffix Tree

Suffix Arrays 3

Prefix and Right-Partial Derivative Automata ⋆

40 Years of Suffix Trees