Text Searching

Thierry Lecroq [email protected]

Laboratoire d’Informatique, du Traitement de l’Information et des Syst`emes.

International PhD School in Formal Languages and Applications Tarragona, November 20th and 21st, 2006 Introduction Left-to-right scan — shift Right-to-left scan — shift Plan

1 Introduction

2 Left-to-right scan — shift

3 Right-to-left scan — shift

Thierry Lecroq Text Searching 2/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Plan

1 Introduction

2 Left-to-right scan — shift

3 Right-to-left scan — shift

Thierry Lecroq Text Searching 3/77 Introduction Left-to-right scan — shift Right-to-left scan — shift

text pattern  - position of occurrence Problem Find all the occurrences of pattern x of length m inside the text y of length n

Two types of solutions Fixed pattern preprocessing time O(m) Use of combinatorial properties searching time O(n) Static text preprocessing time O(n) Solutions based on indexes searching time O(m)

Thierry Lecroq Text Searching 4/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Searching for a fixed string

String matching given a pattern x, find all its locations in any text y

Pattern a string x of symbols, of length m t a t a

Text a string y of symbols, of length n c acgtatatatgcgttataat

Thierry Lecroq Text Searching 5/77 Introduction Left-to-right scan — shift Right-to-left scan — shift String matching

Occurrences c acgtatatatgcgttataat

t a t a t a t a

t a t a

Basic operation symbol comparison (=, 6=)

Thierry Lecroq Text Searching 6/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Interest

Practical basic problem for search for various patterns lexical analysis approximate string matching comparisons of strings—alignments ...

Theoretical design of algorithms analysis of algorithms—complexity combinatorics on strings ...

Thierry Lecroq Text Searching 7/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Sliding window strategy window text y shift - § ¤ scan ...... scan ...... scan pattern x shift -

Scan-and-shift mechanism put window at the beginning of text while window on text do scan: if window = pattern then report it shift: shift window to the right and memorize some information for use during next scans and shifts

Thierry Lecroq Text Searching 8/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Naive search

Principles No memorization, shift by 1 position

Complexity O(m × n) time, O(1) extra space

Number of symbol comparisons maximum ≈ m × n expected ≈ 2 × n on a two-letter alphabet, with equiprobablity and independence conditions

Thierry Lecroq Text Searching 9/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Naive string-searching algorithm text y u b 0 pos j n − 1 pattern x u a 0 i m − 1

Algorithm NaiveSearch(string x, y; integer m, n) pos ←− 0 while pos 6 n − m do i ←− 0 while i < m and x[i] = y[pos + i] do i ←− i + 1 if i = m then output(’x occurs in y at position ’, pos) pos ←− pos + 1

Thierry Lecroq Text Searching 10/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Complexity of the problem pattern x of length m text y of length n (n > m)

Theorem The search can be done optimally in time O(n) and space O(1). [Galil and Seiferas, 1983]

Theorem log m The search can be done in optimal expected time O( m × n). [Yao, 1979]

Theorem The maximal number of comparisons done during the search is 9 8 > n + 4m (n − m), and can be made 6 n + 3(m+1) (n − m). [Cole et alii, 1995]

Thierry Lecroq Text Searching 11/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Known bounds on symbol comparisons

Lower bounds Upper bounds access to the whole text n [Folklore] 2n − 1 [Morris and Pratt, 1970] 4 3 n [Zwick and Paterson, 1991] search with a sliding window of size m 3 2 n [Apost. and Croch., 1991] 4 4 3 n 3 n [Galil and Giancarlo, 1991] 4 1 3 n − 3 m [Galil and Giancarlo, 1992] 4 log m+2 n + m (n − m) [Breslauer and Galil, 1993] 9 8 n + 4m (n − m) n + 3(m+1) (n − m) [Cole et alii, 1995]

Thierry Lecroq Text Searching 12/77 Introduction Left-to-right scan — shift Right-to-left scan — shift

Known bounds on symbol comparisons (followed)

Lower bounds Upper bounds search with a sliding window of size 1 2n − 1 [Morris and Pratt, 1970] 1 1 (2 − m )n (2 − m )n [Hancart, 1993] [Breslauer et alii, 1993] delay = maximum number of comp’s on each text symbol m [Morris and Pratt, 1970]

logΦ(m + 1) [Knuth, MP, 1977] min(logΦ(m + 1), c) [Simon, 1989] min(1 + log2 m, c) min(1 + log2 m, c) [Hancart, 1993] log min(1 + log2 m, c) [Hancart, 1996]

Thierry Lecroq Text Searching 13/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Methods

prefix of text in Σ∗pattern

§ pattern ¤

sequential searches (window size = one symbol) ,→ adapted to telecommunications ,→ based on efficient implementations of automata [Knuth, Morris, Pratt, 1976], [Simon, 1989], [Hancart, 1993], [Breslauer, Colussi, Toniolo, 1993]

Thierry Lecroq Text Searching 14/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Methods (followed) time-space optimal searches ,→ mainly of theoretical interest ,→ based on combinatorial properties of strings [Galil, Seiferas, 1983], [Crochemore, Perrin, 1991], [Crochemore, 1992], [G¸asieniec, Plandowski, Rytter, 1995], [Crochemore, G¸asieniec, Rytter, 1999] practically-fast searches ,→ used in text editors, data retrieval software ,→ based on combinatorics + automata (+ heuristics) [Boyer, Moore, 1977], [Galil, 1979], [Apostolico, Giancarlo, 1986], [Crochemore et alii, 1994]

Thierry Lecroq Text Searching 15/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Plan

1 Introduction

2 Left-to-right scan — shift

3 Right-to-left scan — shift

Thierry Lecroq Text Searching 16/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Left-to-right scan — shift text y u b

pattern x u a  - period(u) border(u) ¦ ¥ Mismatch situation: a 6= b period(u) = |u| − |border(u)| Optimal shift length = period(ub) Valid if u = x

Thierry Lecroq Text Searching 17/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Periods and borders

Non-empty string u, integer p, 0 < p 6 |u| p is a period of u if any of these equivalent conditions is satisfied: u[i] = u[i + p], for 1 6 i 6 |u| − p u is a prefix of some yk, k > 0, |y| = p u = yw = wz, for some strings y, z, w with |y| = |z| = p String w is called a border of u The period of u, period(u), is its smallest period (can be |u|) The border of u, border(u), is its longest border (can be empty)

Thierry Lecroq Text Searching 18/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Periods and borders (followed)

Example Periods and borders of abacabacaba 4 abacaba 8 aba 10 a 11 empty string

Thierry Lecroq Text Searching 19/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Sequential search

Simple online search Length of shift = period Memorization of borders

Algorithm while window on text do u ←− longest common prefix of window and pattern if u = pattern then report a match shift window period(u) places to right memorize border(u)

Thierry Lecroq Text Searching 20/77 Introduction Left-to-right scan — shift Right-to-left scan — shift MP algorithm

text y u b 0 pos j n − 1 pattern x u a 0 i m − 1

0 MPnext[i] m − 1

Thierry Lecroq Text Searching 21/77 Introduction Left-to-right scan — shift Right-to-left scan — shift MP algorithm (followed)

Algorithm MP(string x, y; integer m, n) i ←− 0; j ←− 0 while j < n do while (i = m) or (i > 0 and x[i] 6= y[j]) do i ←− MPnext[i] i ←− i + 1; j ←− j + 1 if i = m then output(’x occurs in y at position’, j − i)

Thierry Lecroq Text Searching 22/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Computing borders of prefixes

A border of a border of u is a border of u A border of u is either border(u) or a border of it Border[i] = |border(x[0 . . i − 1])| j runs through decreasing lengths of borders

Algorithm ComputeBorders(string x; integer m) Border[0] ←− −1 for i ←− 0 to m − 1 do j ←− Border[i] while j > 0 and x[i] 6= x[j] do j ←− Border[j] Border[i + 1] ←− j + 1 return Border

MPnext[i] = Border[i] for i = 0, . . . , m

Thierry Lecroq Text Searching 23/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Improvement

x u σ x w τ w  - p

Interrupted periods — strict borders Changes only the preprocessing of MP algorithm

Thierry Lecroq Text Searching 24/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Improvement (followed)

Algorithm while window on text do u ←− longest common prefix of window and pattern if u = pattern then report a match shift window interruptPperiod(u) places to the right memorize strictBorder(u)

Thierry Lecroq Text Searching 25/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Interrupted periods and strict borders

Fixed string x, non-empty prefix u of x w is a strict border of u if both: w is a border of u wτ is a prefix of x, but uτ is not p is an interrupted period of u if p = |u| − |w| for some strict border |w| of u

Example Prefix abacabacaba of abacabacabacc

Interrupted periods and strict borders of abacabacaba 10 a 11 empty string

Thierry Lecroq Text Searching 26/77 Introduction Left-to-right scan — shift Right-to-left scan — shift KMP preprocessing

k = MPnext[i]  k, if x[i] 6= x[k] or if i = m, KMPnext[i] = KMPnext[k], if x[i] = x[k].

Algorithm ComputeKMPnext(string x; integer m); KMPnext[0] ←− −1; j ←− 0 for i ←− 1 to m − 1 do if x[i] = x[j] then KMPnext[i] ←− KMPnext[j] else KMPnext[i] ←− j do j ←− KMPnext[j] while j > 0 and x[i] 6= x[j] j ←− j + 1 KMPnext[m] ←− j return KMPnext

Thierry Lecroq Text Searching 27/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Running times of MP and KMP

Theorem On a text of length n, MP and KMP string-searching algorithm run in time O(n). They make less than 2n symbol comparisons.

Proof. Positive comparisons increase the value of j Negative comparisons increase the value of j − i (shift)

Thierry Lecroq Text Searching 28/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Running times of MP and KMP (followed)

Delay = maximum number of comparisons on a text symbol

Theorem Pattern of length m. The delay for MP algorithm is no more than m. The delay for KMP algorithm√ is no more than logΦ(m + 1), where Φ is the golden ratio, (1 + 5)/2.

Proof. For KMP, use the periodicity theorem

A worst-case pattern of length 19: abaababaabaababaaba

Thierry Lecroq Text Searching 29/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Periodicities

Theorem (Fine & Wilf 1965) If p and q are periods of a word x and satisfy p + q − GCD(p, q) 6 |x| then GCD(p, q) is a period of x.

Used in the analysis of KMP algorithm and in the analysis of many other pattern matching algorithms.

Theorem (Weak version) If p and q are periods of a word x and satisfy p + q 6 |x| then GCD(p, q) is a period of x.

Proof. If p and q are periods of x, p > q, then p − q is also a period of x.

Thierry Lecroq Text Searching 30/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Proof of the weak statement

p and q periods of x with p + q 6 |x| and p > q p − q period of x because:

p -

a b c  - p − q q p -

a bc   - q p − q rest of the proof like Euclid’s induction

Thierry Lecroq Text Searching 31/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Searching with an automaton

Uses the string-matching automaton SMA(x): smallest deterministic automaton accepting Σ∗x

Example x = abaa

Search for abaa in: babbaabaabaabba ··· state 0 0 1 2 0 1 1 2 3 4 2 3 4 2 0 1 ···

Thierry Lecroq Text Searching 32/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Searching algorithm

Simple parsing of the text with the string-matching automaton SMA(x)

Algorithm SEARCH(string x, y; integer m, n) (Q, Σ, initial, {terminal }, δ) is the automaton SMA(x) q ←− initial state if q = terminal then report an occurrence of x in y while not end of y do σ ←− next symbol of y q ←− δ(q, σ) if q = terminal then report an occurrence of x in y

Thierry Lecroq Text Searching 33/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Construction of SMA(x)

Unwinding arcs From SMA(abaa)

to SMA(abaab)

Thierry Lecroq Text Searching 34/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Construction algorithm

Algorithm automaton SMA(string x) let initial be a new state Q ←− {initial } terminal ←− initial for all σ in Σ do δ(initial, σ) ←− initial while not end of x do τ ←− next symbol of x r ←− δ(terminal, τ) add new state s to Q δ(terminal, τ) ←− s for all σ in Σ do δ(s, σ) ←− δ(r, σ) terminal ←− s return (Q, Σ, initial, {terminal }, δ)

Thierry Lecroq Text Searching 35/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Significant arcs

Example Complete SMA(ananas)

a n, s a a 'a $ § §   - -a§ - n- a¤ - n- a- s - 0 1 2 3 4 5 6  s n n, s        ¦ ¥ s ¦ ¥  n, s & n,% s & % & %

Thierry Lecroq Text Searching 36/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Significant arcs (followed)

Forward arcs: spell the pattern Backward arcs: arcs going backwards without reaching the initial state

a a a 'a $ §   - -a§ - n- a¤ - n- a- s - 0 1 2 3 4 5 6  n        Lemma ¦ ¥ SMA(x) contains at most |x| backward arcs.

Consequence The implementation of SMA(x) can be done in O(|x|) time and space, independently of the alphabet size

Thierry Lecroq Text Searching 37/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Backward arcs in SMA

States of SMA(x) identified with prefixes of x A backward arc is of the form (u, τ, vτ) (u, v prefixes of x, τ symbol) with vτ is the longest suffix of uτ that is a prefix of x, and u 6= v Note: uτ is not a prefix of x Let p(u, τ) = |u| − |v| (a period of u)

Backward arcs to periods Two backward arcs (u, τ, vτ), (u0, τ 0, v0τ 0) If p(u, τ) = p(u0, τ 0) = p, then vτ = v0τ 0. Otherwise, if for instance vτ is a proper prefix of v0τ 0, vτ occurs at position p like v0 does, uτ is a prefix of x, a contradiction Thus v = v0, τ = τ 0, and then u = u0

Thierry Lecroq Text Searching 38/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Backward arcs in SMA (followed)

Each period p, 1 6 p 6 |x|, corresponds to at most one backward arc, thus there are at most |x| such arcs

A worst case SMA(abm−1) has m backward arcs (a 6= b)

Thierry Lecroq Text Searching 39/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Complexity of searching with SMA

Pattern x of length m, text y of length n

With complete SMA implemented by transition matrix Preprocessing on pattern x time O(m × card Σ) space O(m × card Σ) Search on text y time O(n) space O(m × card Σ) Delay time constant

With SMA implemented by lists of backward arcs Preprocessing on pattern x time O(m) space O(m) Search on text y time O(n) space O(m) Delay comparisons min(card Σ, log2 m)

Improves on KMP algorithm

Thierry Lecroq Text Searching 40/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Plan

1 Introduction

2 Left-to-right scan — shift

3 Right-to-left scan — shift

Thierry Lecroq Text Searching 41/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Right-to-left scan — shift

Example text ··· c g c t c g c g c t a t c g ··· pattern c g c t a g c c g c t a g c c g c t a g c

Match shift: good-suffix rule (function d) Occurrence heuristics: bad-character rule Extra rules if memorization

Thierry Lecroq Text Searching 42/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Right-to-left scan — shift (followed)

Algorithm while window on text do u ←− longest common suffix of window and pattern if u = pattern then report a match shift window d(u) places to the right

Thierry Lecroq Text Searching 43/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Match shift text y τ u pos pattern x σ u  shift - i u text y ? τ u pos pattern x σ u  shift - i

Precomputation of rightmost occurrences of u’s: O(m) Second shift length = a period of x Table D implements the good-suffix rule: shift = d(u) = D[i]

Thierry Lecroq Text Searching 44/77 Introduction Left-to-right scan — shift Right-to-left scan — shift BM algorithm

text y τ u 0 pos j n − 1 pattern x σ u 0 i m − 1

No memorization of previous matches Table D implements the good-suffix rule

Thierry Lecroq Text Searching 45/77 Introduction Left-to-right scan — shift Right-to-left scan — shift BM algorithm (followed)

Algorithm BM(string x, y; integer m, n) pos ←− 0 while pos 6 n − m do i ←− m − 1 while i > 0 and x[i] = y[pos + i] do i ←− i − 1 if i = −1 then output(’x occurs in y at position ’, pos) pos ←− pos + D[i + 1]

Thierry Lecroq Text Searching 46/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Suffix displacement

Displacement function d: d(u) = min{|z| > 0 | (x suffix of uz) or (τuz suffix of x and τu not suffix of x, for τ ∈ Σ)} Displacement table D: D[i] = d(x[i . . m − 1]), for i = 0, .., m − 1 D[m] = 1

τ u z00 x σ u

u z0

Thierry Lecroq Text Searching 47/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Suffix displacement (followed)

Note 1: u is a (strict) border of uz00 Note 2: |z| is a period of uz (thus, |z0| is a period of x)

Lemma Table D can be computed in linear time.

Thierry Lecroq Text Searching 48/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Suffixes

Elements for computing the table Suf i given: Suf [j] known for i < j 6 m g = min{f − Suf [f] | i < f. 6 m}

τ 0 u0 τ 0 u0 σ0 u0 τ u σ u pattern x

0 g i f i + m − f − 1 m − 1

If i − Suf [i + m − f − 1] 6= k: Suf [i] = min{Suf [i + m − f − 1], i − g}

Thierry Lecroq Text Searching 49/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Suffixes

τ 0 u0 τ 0 u0 σ0 u0 τ u σ u pattern x

0  g i f i + m− f − 1 m − 1 scan scan If i − Suf [i + m − f − 1] = g: scan to find Suf [i] Yields new g and new f

Thierry Lecroq Text Searching 50/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Computing Suf

Algorithm ComputeSUF(string x; integer m) Suf [m − 1] ←− m; g ←− m − 1 for i ←− m − 2 downto 0 do if i > g and Suf [i + m − 1 − f] 6= i − g then Suf [i] ←− min{Suf [i + m − f − 1], i − g} else f ←− i; g ←− min(g, i) while g > 0 and x[g] = x[g + m − 1 − f] do g ←− g − 1 Suf [i] ←− f − g return Suf

Thierry Lecroq Text Searching 51/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Computing Suf

i x a z m − j − 1  -

x c z j  - Suf[j]

D[i] = D[m − 1 − Suf [j]] = m − j − 1

Thierry Lecroq Text Searching 52/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Computing Suf

i x z m − j − 1  -

x j

D[i] = m − j − 1

Thierry Lecroq Text Searching 53/77 Introduction Left-to-right scan — shift Right-to-left scan — shift BM preprocessing

Algorithm ComputeD(string x; integer m; table Suf ) i ←− 0 for j ←− m − 2 downto −1 do if j = −1 or Suf [j] = j + 1 then while i > m − 1 − j do D[i] ←− m − 1 − j i ←− i + 1 for j ←− 0 to m − 2 do D[m − 1 − Suf [j]] ←− m − 1 − j return D

Thierry Lecroq Text Searching 54/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Occurrence shift

text y τ u pos pattern x σ u  shift - τ  - DA[τ]

Table DA implements the bad-character rule: DA[σ] = min({|z| > 0 | σz suffix of x} ∪ {|x|}) shift = DA[τ] − |u| = DA[τ] − m + i

Thierry Lecroq Text Searching 55/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Occurrence shift (followed)

Algorithm ComputeDA(string x; integer m) for all σ in Σ do DA[σ] = m for i ←− 0 to m − 2 do DA[x[i]] = m − i − 1 return DA

Thierry Lecroq Text Searching 56/77 Introduction Left-to-right scan — shift Right-to-left scan — shift BM with occurrence shift

text y τ u 0 pos j n − 1 pattern x σ u 0 i m − 1

Use of DA in addition to D

Thierry Lecroq Text Searching 57/77 Introduction Left-to-right scan — shift Right-to-left scan — shift BM with occurrence shift (followed)

Algorithm BM(string x, y; integer m, n); pos ←− 0 while pos 6 n − m do i ←− m − 1 while i > 0 and x[i] = y[pos + i] do i ←− i − 1 if i = −1 then output(’x occurs in y at position ’, pos) pos ←− pos + max(D[i + 1], DA[y[pos + i]] − m + i + 1)

Thierry Lecroq Text Searching 58/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Complexity of BM

Preprocessing phase match shift O(m) occurrence shift O(m + card Σ)

Search phase (finding all occurrences) running time O(n × m) minimum number of comparisons n/m maximum number of comparisons n × m

Extra space for shift functions O(m + card Σ) can be reduced to O(m)

Thierry Lecroq Text Searching 59/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Symbol comparisons of BM

For finding the first occurrence [Knuth, Morris, Pratt, 1977] 6 7 × n [Guibas, Odlysko, 1980] 6 4 × n [Cole, 1994] 6 3 × n

Theorem If period(x) > m/2, BM searching algorithm performs at most 3n − n/m symbol comparisons. The bound is tight.

Proof. difficult

Thierry Lecroq Text Searching 60/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Symbol comparisons in variants of BM

For finding all occurrences [Galil, 1979] O(n) −→ prefix memorization O(1) extra space [Apostolico, Giancarlo, 1986] 6 1.5 × n −→ all-suffix memorization O(m) extra space [Crochemore et alii, 1994] 6 2 × n −→ last-suffix memorization O(1) extra space −→ Turbo-BM

Thierry Lecroq Text Searching 61/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Turbo-BM method

text y memory- u pos

pattern x u match-shift- u z

Features Stores the last match in case of match shift (memory) Jumps on memory Uses turbo-shifts

Thierry Lecroq Text Searching 62/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Turbo-BM method (followed)

Preprocessing same as BM algorithm

Search O(1) extra space to store memory: (length, right position)

Note match-shift is a period of uz

Thierry Lecroq Text Searching 63/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Turbo-shift text y  memory - match- σ τ pos pattern x σ σ  - turbo-shift

turbo-shift = |memory| − |match| Use in Turbo-BM: shift ← max(match-shift, occ-shift, turbo-shift); if (shift = match-shift) then set memory; else { shift ← max(shift, match+1); no memory; }

Thierry Lecroq Text Searching 64/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Tight number of comparisons for Turbo-BM

Theorem The Turbo-BM searching algorithm runs in time O(n). It makes no more than 2n symbol comparisons.

Proof. difficult

Thierry Lecroq Text Searching 65/77 Introduction Left-to-right scan — shift Right-to-left scan — shift AG method

previous matches  H PP text y  H PP ) ? HHj PPq

 pattern x current match

Thierry Lecroq Text Searching 66/77 Introduction Left-to-right scan — shift Right-to-left scan — shift AG method

Features Stores all previous matches (suffixes of pattern) Uses the table Suf

Suf [i] = longest suffix of x ending at i in x

pattern x Suf [i] i

Search O(m) extra space to store the matches

Thierry Lecroq Text Searching 67/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Rules for shifts in AG match text y

 pattern x Suf [i] current match i mismatch text y

 pattern x 6= Suf [i] current match i

Thierry Lecroq Text Searching 68/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Rules for shifts in AG mismatch text y 6=  pattern x Suf [i] current match i jump text y

?  pattern x Suf [i] current match i

Thierry Lecroq Text Searching 69/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Tight number of comparisons for AG

Lemma n The AG algorithm makes at most 2 comparisons on text characters previously compared.

Theorem The AG searching algorithm runs in time O(n). It makes no more than 1.5n symbol comparisons.

Proof. rather difficult

Thierry Lecroq Text Searching 70/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Worst-case example

Example aabaaabaabaaabaabaaab... a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b pattern = am−1bamb, text = (am−1bamb)e number of comparisons 3m+1 = 2m + 1 + (3m + 1)e = 2m+1 n − m ≈ 1.5n

Thierry Lecroq Text Searching 71/77 Introduction Left-to-right scan — shift Right-to-left scan — shift References

A. Apostolico and M. Crochemore. Optimal Canonization of All Substrings of a String. Inf. Comput., 95(1):76-95, 1991.

A. Apostolico et Z. Galil, editors. Pattern matching algorithms. Oxford University Press, 1997.

A. Apostolico and R. Giancarlo. The Boyer-Moore-Galil string searching strategies revisited. SIAM J. Comput., 15(1):98–105, 1986.

R. S. Boyer and J. S. Moore. A fast string searching algorithm. Commun. ACM, 20(10):762–772, 1977.

D. Breslauer, L. Colussi, and L. Toniolo. Tight comparison bounds for the string prefix-matching problem. Inf. Process. Lett., 47(1):51–57, 1993.

Thierry Lecroq Text Searching 72/77 Introduction Left-to-right scan — shift Right-to-left scan — shift References

D. Breslauer and Z. Galil. Efficient Comparison Based String Matching. J. Complexity, 9(3):339–365, 1993.

M. Crochemore, A. Czumaj, L. G¸asieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two string matching algorithms. Algorithmica, 12(4/5):247–267, 1994.

M. Crochemore, L. G¸asieniec, and W. Rytter. Constant-space string-matching in sublinear average time. Theor. Comput. Sci., 218(1):197–203, 1999.

M. Crochemore, C. Hancart and T. Lecroq. Algorithms on Strings. Cambridge University Press, 20017, to appear.

M. Crochemore and T. Lecroq. Tight bounds on the complexity of the Apostolico-Giancarlo algorithm. Inf. Process. Lett., 63(4):195–203, 1997.

Thierry Lecroq Text Searching 73/77 Introduction Left-to-right scan — shift Right-to-left scan — shift References

M. Crochemore and T. Lecroq. Text Searching and Indexing. Recent Advances in Formal Languages and Applications, Series: Studies in Computational Intelligence, Volume 25, (2006), Z. Esik, C. Martin-Vide and V. Mitrana editors, pages 43-80, Springer Verlag. M. Crochemore and D. Perrin. Two-way string-matching. J. Assoc. Comput. Mach., 38(3):651–675, 1991.

M. Crochemore et W. Rytter. Jewels of Stringology. World Scientific, 2002. R. Cole. Tight bounds on the complexity of the Boyer-Moore string matching algorithm. SIAM J. Comput., 23(5):1075–1091, 1994.

R. Cole, R. Hariharan, M. Paterson, and U. Zwick. Tighter lower bounds on the exact complexity of string matching. SIAM J. Comput., 24(1):30–45, 1995. Thierry Lecroq Text Searching 74/77 Introduction Left-to-right scan — shift Right-to-left scan — shift References

Z. Galil. On improving the worst case running time of the Boyer-Moore string searching algorithm. Commun. ACM, 22(9):505–508, 1979.

Z. Galil and R. Giancarlo. On the exact complexity of string matching: lower bounds. SIAM J. Comput., 20(6):1008–1020, 1991.

Z. Galil and R. Giancarlo. On the Exact Complexity of String Matching: Upper Bounds. SIAM J. Comput., 21(3):407–437, 1992.

Z. Galil and J. Seiferas. Time-space optimal string matching. J. Comput. Syst. Sci., 26(3):280–294, 1983.

L. G¸asieniec, W. Plandowski, and W. Rytter. Constant-space string matching with smaller number of comparisons: sequential sampling.

In Z. Galil and E. Ukkonen,Thierry Lecroq editors, ProceedingsText Searching of the 6th 75/77 Annual Symposium on Combinatorial Pattern Matching, number 937 in Lecture Notes in Computer Science, pages 78–89, Espoo, Finland, 1995. Springer-Verlag, Berlin. L. J. Guibas and A. M. Odlyzko. A new proof of the linearity of the Boyer-Moore string searching algorithm. SIAM J. Comput., 9(4):672–682, 1980. Introduction Left-to-right scan — shift Right-to-left scan — shift References

L. J. Guibas and A. M. Odlyzko. A new proof of the linearity of the Boyer-Moore string searching algorithm. SIAM J. Comput., 9(4):672–682, 1980.

D. Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, 1997.

C. Hancart. On Simon’s string searching algorithm. Inf. Process. Lett., 47(2):95–99, 1993.

C. Hancart, 1996. Personal communication. D. E. Knuth, J. H. Morris, Jr, and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(1):323–350, 1977.

Thierry Lecroq Text Searching 76/77 Introduction Left-to-right scan — shift Right-to-left scan — shift References

J. H. Morris, Jr and V. R. Pratt. A linear pattern-matching algorithm. Report 40, University of California, Berkeley, 1970.

I. Simon. Sequence comparison: some theory and some practice. In M. Gross and D. Perrin, editors, Electronic, Dictionaries and Automata in Computational Linguistics, number 377 in Lecture Notes in Computer Science, pages 79–92. Springer-Verlag, Berlin, 1989.

G. A. Stephen. String searching algorithms. World Scientific Press, 1994.

A. C. Yao. The complexity of pattern matching for a random string. SIAM J. Comput., 8(3):368–387, 1979.

U. Zwick and M. S. Paterson. Lower bounds for string matching in the sequential comparison model. Manuscript, 1991.

Thierry Lecroq Text Searching 77/77