Text Searching
Thierry Lecroq [email protected]
Laboratoire d’Informatique, du Traitement de l’Information et des Syst`emes.
International PhD School in Formal Languages and Applications Tarragona, November 20th and 21st, 2006 Introduction Left-to-right scan — shift Right-to-left scan — shift Plan
1 Introduction
2 Left-to-right scan — shift
3 Right-to-left scan — shift
Thierry Lecroq Text Searching 2/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Plan
1 Introduction
2 Left-to-right scan — shift
3 Right-to-left scan — shift
Thierry Lecroq Text Searching 3/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Pattern matching
text pattern - position of occurrence Problem Find all the occurrences of pattern x of length m inside the text y of length n
Two types of solutions Fixed pattern preprocessing time O(m) Use of combinatorial properties searching time O(n) Static text preprocessing time O(n) Solutions based on indexes searching time O(m)
Thierry Lecroq Text Searching 4/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Searching for a fixed string
String matching given a pattern x, find all its locations in any text y
Pattern a string x of symbols, of length m t a t a
Text a string y of symbols, of length n c acgtatatatgcgttataat
Thierry Lecroq Text Searching 5/77 Introduction Left-to-right scan — shift Right-to-left scan — shift String matching
Occurrences c acgtatatatgcgttataat
t a t a t a t a
t a t a
Basic operation symbol comparison (=, 6=)
Thierry Lecroq Text Searching 6/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Interest
Practical basic problem for search for various patterns lexical analysis approximate string matching comparisons of strings—alignments ...
Theoretical design of algorithms analysis of algorithms—complexity combinatorics on strings ...
Thierry Lecroq Text Searching 7/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Sliding window strategy window text y shift - § ¤ scan ...... scan ...... scan pattern x shift -
Scan-and-shift mechanism put window at the beginning of text while window on text do scan: if window = pattern then report it shift: shift window to the right and memorize some information for use during next scans and shifts
Thierry Lecroq Text Searching 8/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Naive search
Principles No memorization, shift by 1 position
Complexity O(m × n) time, O(1) extra space
Number of symbol comparisons maximum ≈ m × n expected ≈ 2 × n on a two-letter alphabet, with equiprobablity and independence conditions
Thierry Lecroq Text Searching 9/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Naive string-searching algorithm text y u b 0 pos j n − 1 pattern x u a 0 i m − 1
Algorithm NaiveSearch(string x, y; integer m, n) pos ←− 0 while pos 6 n − m do i ←− 0 while i < m and x[i] = y[pos + i] do i ←− i + 1 if i = m then output(’x occurs in y at position ’, pos) pos ←− pos + 1
Thierry Lecroq Text Searching 10/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Complexity of the problem pattern x of length m text y of length n (n > m)
Theorem The search can be done optimally in time O(n) and space O(1). [Galil and Seiferas, 1983]
Theorem log m The search can be done in optimal expected time O( m × n). [Yao, 1979]
Theorem The maximal number of comparisons done during the search is 9 8 > n + 4m (n − m), and can be made 6 n + 3(m+1) (n − m). [Cole et alii, 1995]
Thierry Lecroq Text Searching 11/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Known bounds on symbol comparisons
Lower bounds Upper bounds access to the whole text n [Folklore] 2n − 1 [Morris and Pratt, 1970] 4 3 n [Zwick and Paterson, 1991] search with a sliding window of size m 3 2 n [Apost. and Croch., 1991] 4 4 3 n 3 n [Galil and Giancarlo, 1991] 4 1 3 n − 3 m [Galil and Giancarlo, 1992] 4 log m+2 n + m (n − m) [Breslauer and Galil, 1993] 9 8 n + 4m (n − m) n + 3(m+1) (n − m) [Cole et alii, 1995]
Thierry Lecroq Text Searching 12/77 Introduction Left-to-right scan — shift Right-to-left scan — shift
Known bounds on symbol comparisons (followed)
Lower bounds Upper bounds search with a sliding window of size 1 2n − 1 [Morris and Pratt, 1970] 1 1 (2 − m )n (2 − m )n [Hancart, 1993] [Breslauer et alii, 1993] delay = maximum number of comp’s on each text symbol m [Morris and Pratt, 1970]
logΦ(m + 1) [Knuth, MP, 1977] min(logΦ(m + 1), c) [Simon, 1989] min(1 + log2 m, c) min(1 + log2 m, c) [Hancart, 1993] log min(1 + log2 m, c) [Hancart, 1996]
Thierry Lecroq Text Searching 13/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Methods
prefix of text in Σ∗pattern
§ pattern ¤
sequential searches (window size = one symbol) ,→ adapted to telecommunications ,→ based on efficient implementations of automata [Knuth, Morris, Pratt, 1976], [Simon, 1989], [Hancart, 1993], [Breslauer, Colussi, Toniolo, 1993]
Thierry Lecroq Text Searching 14/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Methods (followed) time-space optimal searches ,→ mainly of theoretical interest ,→ based on combinatorial properties of strings [Galil, Seiferas, 1983], [Crochemore, Perrin, 1991], [Crochemore, 1992], [G¸asieniec, Plandowski, Rytter, 1995], [Crochemore, G¸asieniec, Rytter, 1999] practically-fast searches ,→ used in text editors, data retrieval software ,→ based on combinatorics + automata (+ heuristics) [Boyer, Moore, 1977], [Galil, 1979], [Apostolico, Giancarlo, 1986], [Crochemore et alii, 1994]
Thierry Lecroq Text Searching 15/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Plan
1 Introduction
2 Left-to-right scan — shift
3 Right-to-left scan — shift
Thierry Lecroq Text Searching 16/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Left-to-right scan — shift text y u b
pattern x u a - period(u) border(u) ¦ ¥ Mismatch situation: a 6= b period(u) = |u| − |border(u)| Optimal shift length = period(ub) Valid if u = x
Thierry Lecroq Text Searching 17/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Periods and borders
Non-empty string u, integer p, 0 < p 6 |u| p is a period of u if any of these equivalent conditions is satisfied: u[i] = u[i + p], for 1 6 i 6 |u| − p u is a prefix of some yk, k > 0, |y| = p u = yw = wz, for some strings y, z, w with |y| = |z| = p String w is called a border of u The period of u, period(u), is its smallest period (can be |u|) The border of u, border(u), is its longest border (can be empty)
Thierry Lecroq Text Searching 18/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Periods and borders (followed)
Example Periods and borders of abacabacaba 4 abacaba 8 aba 10 a 11 empty string
Thierry Lecroq Text Searching 19/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Sequential search
Simple online search Length of shift = period Memorization of borders
Algorithm while window on text do u ←− longest common prefix of window and pattern if u = pattern then report a match shift window period(u) places to right memorize border(u)
Thierry Lecroq Text Searching 20/77 Introduction Left-to-right scan — shift Right-to-left scan — shift MP algorithm
text y u b 0 pos j n − 1 pattern x u a 0 i m − 1
0 MPnext[i] m − 1
Thierry Lecroq Text Searching 21/77 Introduction Left-to-right scan — shift Right-to-left scan — shift MP algorithm (followed)
Algorithm MP(string x, y; integer m, n) i ←− 0; j ←− 0 while j < n do while (i = m) or (i > 0 and x[i] 6= y[j]) do i ←− MPnext[i] i ←− i + 1; j ←− j + 1 if i = m then output(’x occurs in y at position’, j − i)
Thierry Lecroq Text Searching 22/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Computing borders of prefixes
A border of a border of u is a border of u A border of u is either border(u) or a border of it Border[i] = |border(x[0 . . i − 1])| j runs through decreasing lengths of borders
Algorithm ComputeBorders(string x; integer m) Border[0] ←− −1 for i ←− 0 to m − 1 do j ←− Border[i] while j > 0 and x[i] 6= x[j] do j ←− Border[j] Border[i + 1] ←− j + 1 return Border
MPnext[i] = Border[i] for i = 0, . . . , m
Thierry Lecroq Text Searching 23/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Improvement
x u σ x w τ w - p
Interrupted periods — strict borders Changes only the preprocessing of MP algorithm
Thierry Lecroq Text Searching 24/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Improvement (followed)
Algorithm while window on text do u ←− longest common prefix of window and pattern if u = pattern then report a match shift window interruptPperiod(u) places to the right memorize strictBorder(u)
Thierry Lecroq Text Searching 25/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Interrupted periods and strict borders
Fixed string x, non-empty prefix u of x w is a strict border of u if both: w is a border of u wτ is a prefix of x, but uτ is not p is an interrupted period of u if p = |u| − |w| for some strict border |w| of u
Example Prefix abacabacaba of abacabacabacc
Interrupted periods and strict borders of abacabacaba 10 a 11 empty string
Thierry Lecroq Text Searching 26/77 Introduction Left-to-right scan — shift Right-to-left scan — shift KMP preprocessing
k = MPnext[i] k, if x[i] 6= x[k] or if i = m, KMPnext[i] = KMPnext[k], if x[i] = x[k].
Algorithm ComputeKMPnext(string x; integer m); KMPnext[0] ←− −1; j ←− 0 for i ←− 1 to m − 1 do if x[i] = x[j] then KMPnext[i] ←− KMPnext[j] else KMPnext[i] ←− j do j ←− KMPnext[j] while j > 0 and x[i] 6= x[j] j ←− j + 1 KMPnext[m] ←− j return KMPnext
Thierry Lecroq Text Searching 27/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Running times of MP and KMP
Theorem On a text of length n, MP and KMP string-searching algorithm run in time O(n). They make less than 2n symbol comparisons.
Proof. Positive comparisons increase the value of j Negative comparisons increase the value of j − i (shift)
Thierry Lecroq Text Searching 28/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Running times of MP and KMP (followed)
Delay = maximum number of comparisons on a text symbol
Theorem Pattern of length m. The delay for MP algorithm is no more than m. The delay for KMP algorithm√ is no more than logΦ(m + 1), where Φ is the golden ratio, (1 + 5)/2.
Proof. For KMP, use the periodicity theorem
A worst-case pattern of length 19: abaababaabaababaaba
Thierry Lecroq Text Searching 29/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Periodicities
Theorem (Fine & Wilf 1965) If p and q are periods of a word x and satisfy p + q − GCD(p, q) 6 |x| then GCD(p, q) is a period of x.
Used in the analysis of KMP algorithm and in the analysis of many other pattern matching algorithms.
Theorem (Weak version) If p and q are periods of a word x and satisfy p + q 6 |x| then GCD(p, q) is a period of x.
Proof. If p and q are periods of x, p > q, then p − q is also a period of x.
Thierry Lecroq Text Searching 30/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Proof of the weak statement
p and q periods of x with p + q 6 |x| and p > q p − q period of x because:
p -
a b c - p − q q p -
a bc - q p − q rest of the proof like Euclid’s induction
Thierry Lecroq Text Searching 31/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Searching with an automaton
Uses the string-matching automaton SMA(x): smallest deterministic automaton accepting Σ∗x
Example x = abaa
Search for abaa in: babbaabaabaabba ··· state 0 0 1 2 0 1 1 2 3 4 2 3 4 2 0 1 ···
Thierry Lecroq Text Searching 32/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Searching algorithm
Simple parsing of the text with the string-matching automaton SMA(x)
Algorithm SEARCH(string x, y; integer m, n) (Q, Σ, initial, {terminal }, δ) is the automaton SMA(x) q ←− initial state if q = terminal then report an occurrence of x in y while not end of y do σ ←− next symbol of y q ←− δ(q, σ) if q = terminal then report an occurrence of x in y
Thierry Lecroq Text Searching 33/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Construction of SMA(x)
Unwinding arcs From SMA(abaa)
to SMA(abaab)
Thierry Lecroq Text Searching 34/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Construction algorithm
Algorithm automaton SMA(string x) let initial be a new state Q ←− {initial } terminal ←− initial for all σ in Σ do δ(initial, σ) ←− initial while not end of x do τ ←− next symbol of x r ←− δ(terminal, τ) add new state s to Q δ(terminal, τ) ←− s for all σ in Σ do δ(s, σ) ←− δ(r, σ) terminal ←− s return (Q, Σ, initial, {terminal }, δ)
Thierry Lecroq Text Searching 35/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Significant arcs
Example Complete SMA(ananas)
a n, s a a 'a $ § § - -a§ - n- a¤ - n- a- s - 0 1 2 3 4 5 6 s n n, s ¦ ¥ s ¦ ¥ n, s & n,% s & % & %
Thierry Lecroq Text Searching 36/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Significant arcs (followed)
Forward arcs: spell the pattern Backward arcs: arcs going backwards without reaching the initial state
a a a 'a $ § - -a§ - n- a¤ - n- a- s - 0 1 2 3 4 5 6 n Lemma ¦ ¥ SMA(x) contains at most |x| backward arcs.
Consequence The implementation of SMA(x) can be done in O(|x|) time and space, independently of the alphabet size
Thierry Lecroq Text Searching 37/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Backward arcs in SMA
States of SMA(x) identified with prefixes of x A backward arc is of the form (u, τ, vτ) (u, v prefixes of x, τ symbol) with vτ is the longest suffix of uτ that is a prefix of x, and u 6= v Note: uτ is not a prefix of x Let p(u, τ) = |u| − |v| (a period of u)
Backward arcs to periods Two backward arcs (u, τ, vτ), (u0, τ 0, v0τ 0) If p(u, τ) = p(u0, τ 0) = p, then vτ = v0τ 0. Otherwise, if for instance vτ is a proper prefix of v0τ 0, vτ occurs at position p like v0 does, uτ is a prefix of x, a contradiction Thus v = v0, τ = τ 0, and then u = u0
Thierry Lecroq Text Searching 38/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Backward arcs in SMA (followed)
Each period p, 1 6 p 6 |x|, corresponds to at most one backward arc, thus there are at most |x| such arcs
A worst case SMA(abm−1) has m backward arcs (a 6= b)
Thierry Lecroq Text Searching 39/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Complexity of searching with SMA
Pattern x of length m, text y of length n
With complete SMA implemented by transition matrix Preprocessing on pattern x time O(m × card Σ) space O(m × card Σ) Search on text y time O(n) space O(m × card Σ) Delay time constant
With SMA implemented by lists of backward arcs Preprocessing on pattern x time O(m) space O(m) Search on text y time O(n) space O(m) Delay comparisons min(card Σ, log2 m)
Improves on KMP algorithm
Thierry Lecroq Text Searching 40/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Plan
1 Introduction
2 Left-to-right scan — shift
3 Right-to-left scan — shift
Thierry Lecroq Text Searching 41/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Right-to-left scan — shift
Example text ··· c g c t c g c g c t a t c g ··· pattern c g c t a g c c g c t a g c c g c t a g c
Match shift: good-suffix rule (function d) Occurrence heuristics: bad-character rule Extra rules if memorization
Thierry Lecroq Text Searching 42/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Right-to-left scan — shift (followed)
Algorithm while window on text do u ←− longest common suffix of window and pattern if u = pattern then report a match shift window d(u) places to the right
Thierry Lecroq Text Searching 43/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Match shift text y τ u pos pattern x σ u shift - i u text y ? τ u pos pattern x σ u shift - i
Precomputation of rightmost occurrences of u’s: O(m) Second shift length = a period of x Table D implements the good-suffix rule: shift = d(u) = D[i]
Thierry Lecroq Text Searching 44/77 Introduction Left-to-right scan — shift Right-to-left scan — shift BM algorithm
text y τ u 0 pos j n − 1 pattern x σ u 0 i m − 1
No memorization of previous matches Table D implements the good-suffix rule
Thierry Lecroq Text Searching 45/77 Introduction Left-to-right scan — shift Right-to-left scan — shift BM algorithm (followed)
Algorithm BM(string x, y; integer m, n) pos ←− 0 while pos 6 n − m do i ←− m − 1 while i > 0 and x[i] = y[pos + i] do i ←− i − 1 if i = −1 then output(’x occurs in y at position ’, pos) pos ←− pos + D[i + 1]
Thierry Lecroq Text Searching 46/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Suffix displacement
Displacement function d: d(u) = min{|z| > 0 | (x suffix of uz) or (τuz suffix of x and τu not suffix of x, for τ ∈ Σ)} Displacement table D: D[i] = d(x[i . . m − 1]), for i = 0, .., m − 1 D[m] = 1
τ u z00 x σ u
u z0
Thierry Lecroq Text Searching 47/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Suffix displacement (followed)
Note 1: u is a (strict) border of uz00 Note 2: |z| is a period of uz (thus, |z0| is a period of x)
Lemma Table D can be computed in linear time.
Thierry Lecroq Text Searching 48/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Suffixes
Elements for computing the table Suf i given: Suf [j] known for i < j 6 m g = min{f − Suf [f] | i < f. 6 m}
τ 0 u0 τ 0 u0 σ0 u0 τ u σ u pattern x
0 g i f i + m − f − 1 m − 1
If i − Suf [i + m − f − 1] 6= k: Suf [i] = min{Suf [i + m − f − 1], i − g}
Thierry Lecroq Text Searching 49/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Suffixes
τ 0 u0 τ 0 u0 σ0 u0 τ u σ u pattern x
0 g i f i + m− f − 1 m − 1 scan scan If i − Suf [i + m − f − 1] = g: scan to find Suf [i] Yields new g and new f
Thierry Lecroq Text Searching 50/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Computing Suf
Algorithm ComputeSUF(string x; integer m) Suf [m − 1] ←− m; g ←− m − 1 for i ←− m − 2 downto 0 do if i > g and Suf [i + m − 1 − f] 6= i − g then Suf [i] ←− min{Suf [i + m − f − 1], i − g} else f ←− i; g ←− min(g, i) while g > 0 and x[g] = x[g + m − 1 − f] do g ←− g − 1 Suf [i] ←− f − g return Suf
Thierry Lecroq Text Searching 51/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Computing Suf
i x a z m − j − 1 -
x c z j - Suf[j]
D[i] = D[m − 1 − Suf [j]] = m − j − 1
Thierry Lecroq Text Searching 52/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Computing Suf
i x z m − j − 1 -
x j
D[i] = m − j − 1
Thierry Lecroq Text Searching 53/77 Introduction Left-to-right scan — shift Right-to-left scan — shift BM preprocessing
Algorithm ComputeD(string x; integer m; table Suf ) i ←− 0 for j ←− m − 2 downto −1 do if j = −1 or Suf [j] = j + 1 then while i > m − 1 − j do D[i] ←− m − 1 − j i ←− i + 1 for j ←− 0 to m − 2 do D[m − 1 − Suf [j]] ←− m − 1 − j return D
Thierry Lecroq Text Searching 54/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Occurrence shift
text y τ u pos pattern x σ u shift - τ - DA[τ]
Table DA implements the bad-character rule: DA[σ] = min({|z| > 0 | σz suffix of x} ∪ {|x|}) shift = DA[τ] − |u| = DA[τ] − m + i
Thierry Lecroq Text Searching 55/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Occurrence shift (followed)
Algorithm ComputeDA(string x; integer m) for all σ in Σ do DA[σ] = m for i ←− 0 to m − 2 do DA[x[i]] = m − i − 1 return DA
Thierry Lecroq Text Searching 56/77 Introduction Left-to-right scan — shift Right-to-left scan — shift BM with occurrence shift
text y τ u 0 pos j n − 1 pattern x σ u 0 i m − 1
Use of DA in addition to D
Thierry Lecroq Text Searching 57/77 Introduction Left-to-right scan — shift Right-to-left scan — shift BM with occurrence shift (followed)
Algorithm BM(string x, y; integer m, n); pos ←− 0 while pos 6 n − m do i ←− m − 1 while i > 0 and x[i] = y[pos + i] do i ←− i − 1 if i = −1 then output(’x occurs in y at position ’, pos) pos ←− pos + max(D[i + 1], DA[y[pos + i]] − m + i + 1)
Thierry Lecroq Text Searching 58/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Complexity of BM
Preprocessing phase match shift O(m) occurrence shift O(m + card Σ)
Search phase (finding all occurrences) running time O(n × m) minimum number of comparisons n/m maximum number of comparisons n × m
Extra space for shift functions O(m + card Σ) can be reduced to O(m)
Thierry Lecroq Text Searching 59/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Symbol comparisons of BM
For finding the first occurrence [Knuth, Morris, Pratt, 1977] 6 7 × n [Guibas, Odlysko, 1980] 6 4 × n [Cole, 1994] 6 3 × n
Theorem If period(x) > m/2, BM searching algorithm performs at most 3n − n/m symbol comparisons. The bound is tight.
Proof. difficult
Thierry Lecroq Text Searching 60/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Symbol comparisons in variants of BM
For finding all occurrences [Galil, 1979] O(n) −→ prefix memorization O(1) extra space [Apostolico, Giancarlo, 1986] 6 1.5 × n −→ all-suffix memorization O(m) extra space [Crochemore et alii, 1994] 6 2 × n −→ last-suffix memorization O(1) extra space −→ Turbo-BM
Thierry Lecroq Text Searching 61/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Turbo-BM method
text y memory- u pos
pattern x u match-shift- u z
Features Stores the last match in case of match shift (memory) Jumps on memory Uses turbo-shifts
Thierry Lecroq Text Searching 62/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Turbo-BM method (followed)
Preprocessing same as BM algorithm
Search O(1) extra space to store memory: (length, right position)
Note match-shift is a period of uz
Thierry Lecroq Text Searching 63/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Turbo-shift text y memory - match- σ τ pos pattern x σ σ - turbo-shift
turbo-shift = |memory| − |match| Use in Turbo-BM: shift ← max(match-shift, occ-shift, turbo-shift); if (shift = match-shift) then set memory; else { shift ← max(shift, match+1); no memory; }
Thierry Lecroq Text Searching 64/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Tight number of comparisons for Turbo-BM
Theorem The Turbo-BM searching algorithm runs in time O(n). It makes no more than 2n symbol comparisons.
Proof. difficult
Thierry Lecroq Text Searching 65/77 Introduction Left-to-right scan — shift Right-to-left scan — shift AG method
previous matches H PP text y H PP ) ? HHj PPq
pattern x current match
Thierry Lecroq Text Searching 66/77 Introduction Left-to-right scan — shift Right-to-left scan — shift AG method
Features Stores all previous matches (suffixes of pattern) Uses the table Suf
Suf [i] = longest suffix of x ending at i in x
pattern x Suf [i] i
Search O(m) extra space to store the matches
Thierry Lecroq Text Searching 67/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Rules for shifts in AG match text y
pattern x Suf [i] current match i mismatch text y
pattern x 6= Suf [i] current match i
Thierry Lecroq Text Searching 68/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Rules for shifts in AG mismatch text y 6= pattern x Suf [i] current match i jump text y
? pattern x Suf [i] current match i
Thierry Lecroq Text Searching 69/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Tight number of comparisons for AG
Lemma n The AG algorithm makes at most 2 comparisons on text characters previously compared.
Theorem The AG searching algorithm runs in time O(n). It makes no more than 1.5n symbol comparisons.
Proof. rather difficult
Thierry Lecroq Text Searching 70/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Worst-case example
Example aabaaabaabaaabaabaaab... a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b pattern = am−1bamb, text = (am−1bamb)e number of comparisons 3m+1 = 2m + 1 + (3m + 1)e = 2m+1 n − m ≈ 1.5n
Thierry Lecroq Text Searching 71/77 Introduction Left-to-right scan — shift Right-to-left scan — shift References
A. Apostolico and M. Crochemore. Optimal Canonization of All Substrings of a String. Inf. Comput., 95(1):76-95, 1991.
A. Apostolico et Z. Galil, editors. Pattern matching algorithms. Oxford University Press, 1997.
A. Apostolico and R. Giancarlo. The Boyer-Moore-Galil string searching strategies revisited. SIAM J. Comput., 15(1):98–105, 1986.
R. S. Boyer and J. S. Moore. A fast string searching algorithm. Commun. ACM, 20(10):762–772, 1977.
D. Breslauer, L. Colussi, and L. Toniolo. Tight comparison bounds for the string prefix-matching problem. Inf. Process. Lett., 47(1):51–57, 1993.
Thierry Lecroq Text Searching 72/77 Introduction Left-to-right scan — shift Right-to-left scan — shift References
D. Breslauer and Z. Galil. Efficient Comparison Based String Matching. J. Complexity, 9(3):339–365, 1993.
M. Crochemore, A. Czumaj, L. G¸asieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two string matching algorithms. Algorithmica, 12(4/5):247–267, 1994.
M. Crochemore, L. G¸asieniec, and W. Rytter. Constant-space string-matching in sublinear average time. Theor. Comput. Sci., 218(1):197–203, 1999.
M. Crochemore, C. Hancart and T. Lecroq. Algorithms on Strings. Cambridge University Press, 20017, to appear.
M. Crochemore and T. Lecroq. Tight bounds on the complexity of the Apostolico-Giancarlo algorithm. Inf. Process. Lett., 63(4):195–203, 1997.
Thierry Lecroq Text Searching 73/77 Introduction Left-to-right scan — shift Right-to-left scan — shift References
M. Crochemore and T. Lecroq. Text Searching and Indexing. Recent Advances in Formal Languages and Applications, Series: Studies in Computational Intelligence, Volume 25, (2006), Z. Esik, C. Martin-Vide and V. Mitrana editors, pages 43-80, Springer Verlag. M. Crochemore and D. Perrin. Two-way string-matching. J. Assoc. Comput. Mach., 38(3):651–675, 1991.
M. Crochemore et W. Rytter. Jewels of Stringology. World Scientific, 2002. R. Cole. Tight bounds on the complexity of the Boyer-Moore string matching algorithm. SIAM J. Comput., 23(5):1075–1091, 1994.
R. Cole, R. Hariharan, M. Paterson, and U. Zwick. Tighter lower bounds on the exact complexity of string matching. SIAM J. Comput., 24(1):30–45, 1995. Thierry Lecroq Text Searching 74/77 Introduction Left-to-right scan — shift Right-to-left scan — shift References
Z. Galil. On improving the worst case running time of the Boyer-Moore string searching algorithm. Commun. ACM, 22(9):505–508, 1979.
Z. Galil and R. Giancarlo. On the exact complexity of string matching: lower bounds. SIAM J. Comput., 20(6):1008–1020, 1991.
Z. Galil and R. Giancarlo. On the Exact Complexity of String Matching: Upper Bounds. SIAM J. Comput., 21(3):407–437, 1992.
Z. Galil and J. Seiferas. Time-space optimal string matching. J. Comput. Syst. Sci., 26(3):280–294, 1983.
L. G¸asieniec, W. Plandowski, and W. Rytter. Constant-space string matching with smaller number of comparisons: sequential sampling.
In Z. Galil and E. Ukkonen,Thierry Lecroq editors, ProceedingsText Searching of the 6th 75/77 Annual Symposium on Combinatorial Pattern Matching, number 937 in Lecture Notes in Computer Science, pages 78–89, Espoo, Finland, 1995. Springer-Verlag, Berlin. L. J. Guibas and A. M. Odlyzko. A new proof of the linearity of the Boyer-Moore string searching algorithm. SIAM J. Comput., 9(4):672–682, 1980. Introduction Left-to-right scan — shift Right-to-left scan — shift References
L. J. Guibas and A. M. Odlyzko. A new proof of the linearity of the Boyer-Moore string searching algorithm. SIAM J. Comput., 9(4):672–682, 1980.
D. Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, 1997.
C. Hancart. On Simon’s string searching algorithm. Inf. Process. Lett., 47(2):95–99, 1993.
C. Hancart, 1996. Personal communication. D. E. Knuth, J. H. Morris, Jr, and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(1):323–350, 1977.
Thierry Lecroq Text Searching 76/77 Introduction Left-to-right scan — shift Right-to-left scan — shift References
J. H. Morris, Jr and V. R. Pratt. A linear pattern-matching algorithm. Report 40, University of California, Berkeley, 1970.
I. Simon. Sequence comparison: some theory and some practice. In M. Gross and D. Perrin, editors, Electronic, Dictionaries and Automata in Computational Linguistics, number 377 in Lecture Notes in Computer Science, pages 79–92. Springer-Verlag, Berlin, 1989.
G. A. Stephen. String searching algorithms. World Scientific Press, 1994.
A. C. Yao. The complexity of pattern matching for a random string. SIAM J. Comput., 8(3):368–387, 1979.
U. Zwick and M. S. Paterson. Lower bounds for string matching in the sequential comparison model. Manuscript, 1991.
Thierry Lecroq Text Searching 77/77