Boyer-Moore Strategy to Efficient Approximate String Matching

Boyer-Moore strategy to eﬀicient approximate string matching Nadia El Mabrouk, Maxime Crochemore To cite this version: Nadia El Mabrouk, Maxime Crochemore. Boyer-Moore strategy to eﬀicient approximate string matching. Combinatorial Pattern Matching (Labuna Beach, California, 1996), 1996, France. pp.24-38. hal-00620020 HAL Id: hal-00620020 https://hal-upec-upem.archives-ouvertes.fr/hal-00620020 Submitted on 26 Mar 2013 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. BoyerMoore strategy to ecient approximate string matching Nadia ElMabrouk and Maxime Cro chemore IGM Universite Marne la Vallee rue de la Butte Verte Noisy Le Grand Cedex Abstract We prop ose a simple but ecient algorithm for searching all o ccurrences of a pattern or a class of patterns length m in a text length n with at most k mismatches This algorithm relies on the ShiftAdd algorithm of BaezaYates and Gonnet which involves representing by a bit number the current state of the search and uses the ability of programming languages to handle bit words State representation should not therefore exceeds the word size that is mdlog k e This algorithm consists 2 in a prepro cessing step and a searching step It is linear and p erforms n op erations during the searching step Notions of shift and character skip found in the BoyerMoore BM approach are introduced in this algorithm Provided that the considered alphab et is large enough compared to the Pattern length the average number of op erations p erformed by our algorithm during the searching k +4 step b ecomes n mk Introduction Our purp ose is approximate matching of a pattern or a class of patterns in a text all sequences of characters or classes of characters from a nite alphab et Er rors considered here are mismatches A class of patterns is a set of patterns with dont care symbols patterns containing the complementary of a character or any other class of characters Such a problem has a lot of applications in particular in molecular biology for predicting p otential nuclear geneco ding sequences in genomic DNA sequences In fact exact string matching is not sucient since geneco ding sequences are in general only partially and approximately sp ecied Concerning exact string matching algorithms based on the BoyerMoore BM approach are the fastest in practice Such algorithms are linear and may even have a sublinear b ehaviour in the sense that every character in the text need not b e checked In certain cases text characters can b e skipp ed without missing a pattern o ccurrence The larger the alphab et and the longer the pattern the faster the algorithm works Various algorithms have b een developed for searching with k mismatches all o ccurrences of a pattern length m in a text length n b oth dened over an alphab et length c Running times have ranged from O mn for the naive algorithm to O k n or O n log m The rst two algorithms consist in a prepro cessing step and a searching step Grossi and Luccio algorithm uses the sux tree Other algorithms have used the BM approach in approximate string matching Running times are O k n for BaezaYates and Gonnet k 1 for Tarhio and Ukkonen The problem of approximate and O k n mk c matching of a class of patterns was also studied esp ecially in the case of patterns with dont care symbols Fisher et Paterson 2 developed an O n log c log m log log m time algorithm based on the linear pro duct Abrahamson extended this metho d for generalized string pattern Pinter has used the Aho and Corasick automaton for searching a set of patterns Other algorithms have considered the problem of exact matching of patterns with variable length dont cares As for Akutsu he p 2 m m developed an O log log time algorithm for searching a k m n log c log k k pattern with dont cares in a text with dont cares In several new algorithms for approximate string matching were pub lished They combine b oth sp eed and programming practicality in contrast with older results most of which b eing mainly of theoretical interest Moreover they are exible enough to allow searching for a class of patterns These algorithms consist in a pattern prepro cessing step and a searching step They are all based on the same approach consisting in nding at a given p osi tion in the text all approximate pattern prexes ending at this p osition Sp eed is increased by representing the state of the search as a bit number or an array and by using the ability of programming languages to handle bit words Nevertheless these algorithms are based on a naive approach and pro cess each character of the text Our goal is to sp eed up searching by using a BM strategy and including notions of shift and character skip We have chosen to consider such an improvement in the case of the Shift Add algorithm of BaezaYates and Gonnet The main idea of ShiftAdd is to represent the state of the search as a bit number and p erform a few simple arithmetic and logical op erations Provided that representations dont exceed the word size that is mdlog k e each search step do es exactly 2 a shift a test and an addition Therefore this algorithm runs in O n time and the searching step do es n op erations We developed an algorithm combining the practicality of the ShiftAdd metho d and the sp eed of the BM approach Provided that the considered alphab et is large enough compared to m our new algorithm k +4 op erations during the searching step p erforms on average n mk The pap er is organized as follows Section summarises the algorithm Shift Add in the case of exact or approximate matching of a pattern or a class of patterns Section develops the adaptation of the BM approach to the Shift Add metho d An improvement of this last algorithm is given in Section Finally section gives exp erimental results obtained with b oth algorithms ShiftAdd Algorithm Let P p p b e a pattern and t t t b e a text over a nite alphab et 1 m 1 n The problem is to nd in t all o ccurrences of P with at most k mismatches k m In other words the distance b etween two patterns of the same length will b e dened as the number of their mismatching characters the Hamming distance An equivalent problem is then to nd in t all substrings of length m such that the Hamming distance b etween these substrings and P is at most k that is to nd all j p ositions in the text such that for i m p t i j m+i except for at most k indices The main idea is to represent the state of the search as a vector of size m Thus S denotes the state vector given a current p osition j in the text j S contains individual states of the search b etween each prex of P and the j corresp onding substring of t Namely for i m S i is the number of j mismatches b etween p p and t t 1 i j i+1 j P matches at j if and only if S m k j When t is read the number of mismatches for each prex of P needs to j +1 b e completed Values of b o olean expressions t p for i m can b e j +1 i computed during a prepro cessing step For each character a in a vector T of a size m is constructed such that if a p i For i i m T i (1) a otherwise it is sucient to construct the T arrays only for characters app earing in the pattern i Finally S i S i T j +1 j t j +1 In order to obtain S from S by simple arithmetic and logical op erations j +1 j b vectors are considered as numbers and represented in base where b is the bit number needed to represent each vector comp onent m m X X (i1)b (i1)b T i S i and T Thus S a j a j i=1 i=1 Representations should not exceed the word size namely mb It is easy now to verify that the transition from S to S amounts to no j j +1 more than a left shift denoted by of b bits and an addition (2) S S b T j +1 j t j +1 (m1)b Initial state is S P matches at j if and only if S k 0 j Possible values of the vector state comp onents are m Thus to rep resent each comp onent b dlog m e bits are required However since we 2 need only to compare the number of mismatches with k it is enough to represent values from to k In this case one more bit is needed for carrying over addi tions The improved algorithm uses b dlog k e bits At each p osition 2 j in the text the overow bits are recorded in an overow state R and the j overow bits of S are reset j The ShiftAdd algorithm works in O n time and the searching step disre garding the overow state p erforms n op erations In fact at each step that is for each p osition j three op erations are p erformed one shift one addition and one test to determine whether P matches at p osition j Exact string matching In the case of exact string matching it is only necessary to know whether a given prex of P matches exactly the considered substring of t We dene S as j follows if p p t t 1 i j i+1 j For i m S i j otherwise When t is read we need to determine whether t can extend any of the j +1 j +1 partial matches Thus in order to have a match of p p at p osition j 1 i b oth S i and t p should b e satised Here b and in formula j j +1 i the symbol should

Load more