Text Searching

Text Searching

Text Searching Thierry Lecroq [email protected] Laboratoire d’Informatique, du Traitement de l’Information et des Syst`emes. International PhD School in Formal Languages and Applications Tarragona, November 20th and 21st, 2006 Introduction Left-to-right scan — shift Right-to-left scan — shift Plan 1 Introduction 2 Left-to-right scan — shift 3 Right-to-left scan — shift Thierry Lecroq Text Searching 2/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Plan 1 Introduction 2 Left-to-right scan — shift 3 Right-to-left scan — shift Thierry Lecroq Text Searching 3/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Pattern matching text pattern - position of occurrence Problem Find all the occurrences of pattern x of length m inside the text y of length n Two types of solutions Fixed pattern preprocessing time O(m) Use of combinatorial properties searching time O(n) Static text preprocessing time O(n) Solutions based on indexes searching time O(m) Thierry Lecroq Text Searching 4/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Searching for a fixed string String matching given a pattern x, find all its locations in any text y Pattern a string x of symbols, of length m t a t a Text a string y of symbols, of length n c acgtatatatgcgttataat Thierry Lecroq Text Searching 5/77 Introduction Left-to-right scan — shift Right-to-left scan — shift String matching Occurrences c acgtatatatgcgttataat t a t a t a t a t a t a Basic operation symbol comparison (=, 6=) Thierry Lecroq Text Searching 6/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Interest Practical basic problem for search for various patterns lexical analysis approximate string matching comparisons of strings—alignments ... Theoretical design of algorithms analysis of algorithms—complexity combinatorics on strings ... Thierry Lecroq Text Searching 7/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Sliding window strategy window text y shift - § ¤ scan . scan . scan pattern x shift - Scan-and-shift mechanism put window at the beginning of text while window on text do scan: if window = pattern then report it shift: shift window to the right and memorize some information for use during next scans and shifts Thierry Lecroq Text Searching 8/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Naive search Principles No memorization, shift by 1 position Complexity O(m × n) time, O(1) extra space Number of symbol comparisons maximum ≈ m × n expected ≈ 2 × n on a two-letter alphabet, with equiprobablity and independence conditions Thierry Lecroq Text Searching 9/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Naive string-searching algorithm text y u b 0 pos j n − 1 pattern x u a 0 i m − 1 Algorithm NaiveSearch(string x, y; integer m, n) pos ←− 0 while pos 6 n − m do i ←− 0 while i < m and x[i] = y[pos + i] do i ←− i + 1 if i = m then output(’x occurs in y at position ’, pos) pos ←− pos + 1 Thierry Lecroq Text Searching 10/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Complexity of the problem pattern x of length m text y of length n (n > m) Theorem The search can be done optimally in time O(n) and space O(1). [Galil and Seiferas, 1983] Theorem log m The search can be done in optimal expected time O( m × n). [Yao, 1979] Theorem The maximal number of comparisons done during the search is 9 8 > n + 4m (n − m), and can be made 6 n + 3(m+1) (n − m). [Cole et alii, 1995] Thierry Lecroq Text Searching 11/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Known bounds on symbol comparisons Lower bounds Upper bounds access to the whole text n [Folklore] 2n − 1 [Morris and Pratt, 1970] 4 3 n [Zwick and Paterson, 1991] search with a sliding window of size m 3 2 n [Apost. and Croch., 1991] 4 4 3 n 3 n [Galil and Giancarlo, 1991] 4 1 3 n − 3 m [Galil and Giancarlo, 1992] 4 log m+2 n + m (n − m) [Breslauer and Galil, 1993] 9 8 n + 4m (n − m) n + 3(m+1) (n − m) [Cole et alii, 1995] Thierry Lecroq Text Searching 12/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Known bounds on symbol comparisons (followed) Lower bounds Upper bounds search with a sliding window of size 1 2n − 1 [Morris and Pratt, 1970] 1 1 (2 − m )n (2 − m )n [Hancart, 1993] [Breslauer et alii, 1993] delay = maximum number of comp’s on each text symbol m [Morris and Pratt, 1970] logΦ(m + 1) [Knuth, MP, 1977] min(logΦ(m + 1), c) [Simon, 1989] min(1 + log2 m, c) min(1 + log2 m, c) [Hancart, 1993] log min(1 + log2 m, c) [Hancart, 1996] Thierry Lecroq Text Searching 13/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Methods prefix of text in Σ∗pattern § pattern ¤ sequential searches (window size = one symbol) ,→ adapted to telecommunications ,→ based on efficient implementations of automata [Knuth, Morris, Pratt, 1976], [Simon, 1989], [Hancart, 1993], [Breslauer, Colussi, Toniolo, 1993] Thierry Lecroq Text Searching 14/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Methods (followed) time-space optimal searches ,→ mainly of theoretical interest ,→ based on combinatorial properties of strings [Galil, Seiferas, 1983], [Crochemore, Perrin, 1991], [Crochemore, 1992], [G¸asieniec, Plandowski, Rytter, 1995], [Crochemore, G¸asieniec, Rytter, 1999] practically-fast searches ,→ used in text editors, data retrieval software ,→ based on combinatorics + automata (+ heuristics) [Boyer, Moore, 1977], [Galil, 1979], [Apostolico, Giancarlo, 1986], [Crochemore et alii, 1994] Thierry Lecroq Text Searching 15/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Plan 1 Introduction 2 Left-to-right scan — shift 3 Right-to-left scan — shift Thierry Lecroq Text Searching 16/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Left-to-right scan — shift text y u b pattern x u a - period(u) border(u) ¦ ¥ Mismatch situation: a 6= b period(u) = |u| − |border(u)| Optimal shift length = period(ub) Valid if u = x Thierry Lecroq Text Searching 17/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Periods and borders Non-empty string u, integer p, 0 < p 6 |u| p is a period of u if any of these equivalent conditions is satisfied: u[i] = u[i + p], for 1 6 i 6 |u| − p u is a prefix of some yk, k > 0, |y| = p u = yw = wz, for some strings y, z, w with |y| = |z| = p String w is called a border of u The period of u, period(u), is its smallest period (can be |u|) The border of u, border(u), is its longest border (can be empty) Thierry Lecroq Text Searching 18/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Periods and borders (followed) Example Periods and borders of abacabacaba 4 abacaba 8 aba 10 a 11 empty string Thierry Lecroq Text Searching 19/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Sequential search Simple online search Length of shift = period Memorization of borders Algorithm while window on text do u ←− longest common prefix of window and pattern if u = pattern then report a match shift window period(u) places to right memorize border(u) Thierry Lecroq Text Searching 20/77 Introduction Left-to-right scan — shift Right-to-left scan — shift MP algorithm text y u b 0 pos j n − 1 pattern x u a 0 i m − 1 0 MPnext[i] m − 1 Thierry Lecroq Text Searching 21/77 Introduction Left-to-right scan — shift Right-to-left scan — shift MP algorithm (followed) Algorithm MP(string x, y; integer m, n) i ←− 0; j ←− 0 while j < n do while (i = m) or (i > 0 and x[i] 6= y[j]) do i ←− MPnext[i] i ←− i + 1; j ←− j + 1 if i = m then output(’x occurs in y at position’, j − i) Thierry Lecroq Text Searching 22/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Computing borders of prefixes A border of a border of u is a border of u A border of u is either border(u) or a border of it Border[i] = |border(x[0 . i − 1])| j runs through decreasing lengths of borders Algorithm ComputeBorders(string x; integer m) Border[0] ←− −1 for i ←− 0 to m − 1 do j ←− Border[i] while j > 0 and x[i] 6= x[j] do j ←− Border[j] Border[i + 1] ←− j + 1 return Border MPnext[i] = Border[i] for i = 0, . , m Thierry Lecroq Text Searching 23/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Improvement x u σ x w τ w - p Interrupted periods — strict borders Changes only the preprocessing of MP algorithm Thierry Lecroq Text Searching 24/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Improvement (followed) Algorithm while window on text do u ←− longest common prefix of window and pattern if u = pattern then report a match shift window interruptPperiod(u) places to the right memorize strictBorder(u) Thierry Lecroq Text Searching 25/77 Introduction Left-to-right scan — shift Right-to-left scan — shift Interrupted periods and strict borders Fixed string x, non-empty prefix u of x w is a strict border of u if both: w is a border of u wτ is a prefix of x, but uτ is not p is an interrupted period of u if p = |u| − |w| for some strict border |w| of u Example Prefix abacabacaba of abacabacabacc Interrupted periods and strict borders of abacabacaba 10 a 11 empty string Thierry Lecroq Text Searching 26/77 Introduction Left-to-right scan — shift Right-to-left scan — shift KMP preprocessing k = MPnext[i] k, if x[i] 6= x[k] or if i = m, KMPnext[i] = KMPnext[k], if x[i] = x[k].

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    77 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us